2023-10-09 11:41:15,353 INFO [train.py:1099] (0/4) Training started 2023-10-09 11:41:15,358 INFO [train.py:1109] (0/4) Device: cuda:0 2023-10-09 11:41:15,398 INFO [train.py:1121] (0/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.23.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '0d7ef1a7867f70354ab5c59f2feb98c45558dcc7', 'k2-git-date': 'Sat Mar 18 12:59:04 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.0', 'torch-cuda-available': True, 'torch-cuda-version': '11.8', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': '9a94348-dirty', 'icefall-git-date': 'Wed Sep 20 16:11:36 2023', 'icefall-path': '/mnt/lustre/sjtu/home/yfy62/icefall-phone2', 'k2-path': '/home/yfy62/anaconda3/envs/icefall/lib/python3.10/site-packages/k2-1.23.4.dev20230319+cuda11.8.torch2.0.0-py3.10-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/home/yfy62/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'd3-hpc-sjtu-test-004', 'IP address': '10.11.11.11'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_XL_bpe'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.0, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 8000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 700, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'vocab_size': 500} 2023-10-09 11:41:15,398 INFO [train.py:1123] (0/4) About to create model 2023-10-09 11:41:16,048 INFO [train.py:1127] (0/4) Number of model parameters: 65549011 2023-10-09 11:41:18,429 INFO [train.py:1142] (0/4) Using DDP 2023-10-09 11:41:18,774 INFO [asr_datamodule.py:396] (0/4) About to get train XL cuts 2023-10-09 11:41:18,778 INFO [asr_datamodule.py:405] (0/4) Loading GigaSpeech 1000 splits in lazy mode 2023-10-09 11:42:05,507 INFO [asr_datamodule.py:230] (0/4) Enable MUSAN 2023-10-09 11:42:05,507 INFO [asr_datamodule.py:231] (0/4) About to get Musan cuts 2023-10-09 11:42:07,921 INFO [asr_datamodule.py:255] (0/4) Enable SpecAugment 2023-10-09 11:42:07,922 INFO [asr_datamodule.py:256] (0/4) Time warp factor: 80 2023-10-09 11:42:07,922 INFO [asr_datamodule.py:266] (0/4) Num frame mask: 10 2023-10-09 11:42:07,922 INFO [asr_datamodule.py:279] (0/4) About to create train dataset 2023-10-09 11:42:07,922 INFO [asr_datamodule.py:306] (0/4) Using DynamicBucketingSampler. 2023-10-09 11:42:19,149 INFO [asr_datamodule.py:321] (0/4) About to create train dataloader 2023-10-09 11:42:19,150 INFO [asr_datamodule.py:420] (0/4) About to get dev cuts 2023-10-09 11:42:19,152 INFO [asr_datamodule.py:352] (0/4) About to create dev dataset 2023-10-09 11:42:19,638 INFO [asr_datamodule.py:366] (0/4) About to create dev dataloader 2023-10-09 11:42:47,053 INFO [train.py:1031] (0/4) Epoch 1, batch 0, loss[loss=7.748, simple_loss=7.05, pruned_loss=6.967, over 16448.00 frames. ], tot_loss[loss=7.748, simple_loss=7.05, pruned_loss=6.967, over 16448.00 frames. ], batch size: 50, lr: 2.25e-02, grad_scale: 1.0 2023-10-09 11:42:47,055 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-09 11:42:54,840 INFO [train.py:1063] (0/4) Epoch 1, validation: loss=7.75, simple_loss=7.06, pruned_loss=6.883, over 1020973.00 frames. 2023-10-09 11:42:54,841 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 12750MB 2023-10-09 11:42:59,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=0.0, ans=0.2 2023-10-09 11:43:01,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=0.0, ans=0.2 2023-10-09 11:43:05,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=0.0, ans=0.5 2023-10-09 11:43:07,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46.666666666666664, ans=0.2995333333333333 2023-10-09 11:43:14,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=46.666666666666664, ans=0.19825 2023-10-09 11:43:17,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.05 vs. limit=5.011666666666667 2023-10-09 11:43:19,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.21 vs. limit=4.037333333333334 2023-10-09 11:43:21,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=93.33333333333333, ans=7.535 2023-10-09 11:43:26,228 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=106.76 vs. limit=7.57 2023-10-09 11:43:35,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=23.36 vs. limit=7.5525 2023-10-09 11:43:40,383 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.45 vs. limit=3.021 2023-10-09 11:43:40,487 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=249.42 vs. limit=7.5525 2023-10-09 11:43:41,813 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=479.21 vs. limit=7.5525 2023-10-09 11:43:47,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=140.92 vs. limit=5.093333333333334 2023-10-09 11:43:47,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=186.66666666666666, ans=0.49125 2023-10-09 11:43:49,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=96.75 vs. limit=7.57 2023-10-09 11:43:50,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=186.66666666666666, ans=0.8934666666666667 2023-10-09 11:44:02,350 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=3.035 2023-10-09 11:44:05,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=233.33333333333334, ans=0.8918333333333334 2023-10-09 11:44:06,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=27.64 vs. limit=7.675 2023-10-09 11:44:10,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=50.26 vs. limit=7.605 2023-10-09 11:44:19,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=3.042 2023-10-09 11:44:19,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=187.52 vs. limit=7.605 2023-10-09 11:44:23,291 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=45.97 vs. limit=7.745 2023-10-09 11:44:31,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=36.11 vs. limit=7.6225 2023-10-09 11:44:32,790 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=11.63 vs. limit=4.1306666666666665 2023-10-09 11:44:36,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=276.31 vs. limit=7.64 2023-10-09 11:44:38,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=373.3333333333333, ans=0.4825 2023-10-09 11:44:40,406 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=4.149333333333333 2023-10-09 11:44:53,898 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=251.91 vs. limit=7.6575 2023-10-09 11:44:54,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=420.0, ans=0.097375 2023-10-09 11:45:01,441 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=251.28 vs. limit=7.675 2023-10-09 11:45:05,961 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 8.077e+01 1.311e+02 3.462e+02 3.055e+03 2.464e+04, threshold=6.925e+02, percent-clipped=0.0 2023-10-09 11:45:07,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=258.93 vs. limit=5.233333333333333 2023-10-09 11:45:13,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=513.3333333333334, ans=7.6925 2023-10-09 11:45:16,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=513.3333333333334, ans=0.4759375 2023-10-09 11:45:20,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=24.61 vs. limit=7.6925 2023-10-09 11:45:26,930 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=320.16 vs. limit=7.71 2023-10-09 11:45:31,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=52.67 vs. limit=7.92 2023-10-09 11:45:46,111 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=47.35 vs. limit=5.303333333333334 2023-10-09 11:45:56,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=653.3333333333334, ans=0.469375 2023-10-09 11:45:56,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.80 vs. limit=5.163333333333333 2023-10-09 11:46:01,784 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=36.58 vs. limit=7.745 2023-10-09 11:46:04,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=700.0, ans=0.4671875 2023-10-09 11:46:11,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=329.49 vs. limit=7.7625 2023-10-09 11:46:12,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.27 vs. limit=8.025 2023-10-09 11:46:29,669 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=75.41 vs. limit=7.7975 2023-10-09 11:46:41,615 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=27.81 vs. limit=5.21 2023-10-09 11:46:42,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=840.0, ans=5.525 2023-10-09 11:46:54,914 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.02 vs. limit=7.8325 2023-10-09 11:46:57,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=886.6666666666666, ans=0.4584375 2023-10-09 11:47:01,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=886.6666666666666, ans=0.4584375 2023-10-09 11:47:08,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=34.28 vs. limit=7.85 2023-10-09 11:47:08,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.05 vs. limit=8.2 2023-10-09 11:47:09,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.523e+01 8.467e+01 9.386e+01 1.115e+02 3.085e+03, threshold=1.877e+02, percent-clipped=1.0 2023-10-09 11:47:12,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=933.3333333333334, ans=0.8673333333333334 2023-10-09 11:47:26,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=980.0, ans=0.4540625 2023-10-09 11:47:28,487 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=238.70 vs. limit=7.8675 2023-10-09 11:47:28,687 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.63 vs. limit=7.8675 2023-10-09 11:47:33,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1026.6666666666667, ans=0.451875 2023-10-09 11:47:34,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026.6666666666667, ans=0.28973333333333334 2023-10-09 11:47:37,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1026.6666666666667, ans=0.09358333333333334 2023-10-09 11:47:39,048 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.43 vs. limit=5.513333333333334 2023-10-09 11:48:00,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=61.81 vs. limit=7.92 2023-10-09 11:48:10,537 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=18.29 vs. limit=5.5600000000000005 2023-10-09 11:48:35,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1260.0, ans=5.7875 2023-10-09 11:48:37,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=8.445 2023-10-09 11:48:42,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=32.85 vs. limit=7.9725 2023-10-09 11:48:51,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1306.6666666666667, ans=0.8542666666666667 2023-10-09 11:48:51,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.67 vs. limit=8.48 2023-10-09 11:48:54,728 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=8.48 2023-10-09 11:48:58,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.06 vs. limit=5.653333333333333 2023-10-09 11:48:59,258 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=21.02 vs. limit=7.99 2023-10-09 11:49:05,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=14.27 vs. limit=5.338333333333333 2023-10-09 11:49:11,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1353.3333333333333, ans=0.28646666666666665 2023-10-09 11:49:13,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.69 vs. limit=8.515 2023-10-09 11:49:17,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1400.0, ans=0.045625 2023-10-09 11:49:18,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 7.179e+01 9.899e+01 1.175e+02 1.431e+02 2.633e+02, threshold=2.350e+02, percent-clipped=8.0 2023-10-09 11:49:19,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=1400.0, ans=0.236 2023-10-09 11:49:19,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=29.61 vs. limit=8.025 2023-10-09 11:49:23,803 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 11:49:23,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1400.0, ans=0.0685 2023-10-09 11:49:25,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1446.6666666666667, ans=0.4321875 2023-10-09 11:49:29,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1446.6666666666667, ans=0.4321875 2023-10-09 11:49:33,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.69 vs. limit=5.723333333333334 2023-10-09 11:49:34,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=37.93 vs. limit=8.0425 2023-10-09 11:49:36,696 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=73.26 vs. limit=8.0425 2023-10-09 11:49:48,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1493.3333333333333, ans=8.620000000000001 2023-10-09 11:49:49,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=8.620000000000001 2023-10-09 11:49:52,023 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.64 vs. limit=5.77 2023-10-09 11:49:57,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1540.0, ans=0.06534999999999999 2023-10-09 11:50:14,352 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=13.81 vs. limit=8.095 2023-10-09 11:50:16,752 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.90 vs. limit=8.1125 2023-10-09 11:50:24,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1633.3333333333333, ans=0.17713985685619182 2023-10-09 11:50:28,545 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.15 vs. limit=5.42 2023-10-09 11:50:29,380 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.63 vs. limit=5.84 2023-10-09 11:50:30,540 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=82.96 vs. limit=8.13 2023-10-09 11:50:31,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.60 vs. limit=5.84 2023-10-09 11:50:37,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1680.0, ans=0.29000000000000004 2023-10-09 11:50:39,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=75.83 vs. limit=5.84 2023-10-09 11:50:39,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1726.6666666666667, ans=0.28273333333333334 2023-10-09 11:50:50,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=17.39 vs. limit=8.1475 2023-10-09 11:51:00,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.99 vs. limit=5.443333333333333 2023-10-09 11:51:05,780 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=37.32 vs. limit=8.1825 2023-10-09 11:51:08,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1820.0, ans=0.13174999999999998 2023-10-09 11:51:10,207 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.19 vs. limit=8.1825 2023-10-09 11:51:18,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 8.514e+01 1.131e+02 1.381e+02 1.799e+02 3.070e+02, threshold=2.763e+02, percent-clipped=7.0 2023-10-09 11:51:27,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.21 vs. limit=5.933333333333334 2023-10-09 11:51:36,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1913.3333333333333, ans=0.26083333333333336 2023-10-09 11:51:45,886 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=34.53 vs. limit=8.235 2023-10-09 11:51:52,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1960.0, ans=0.408125 2023-10-09 11:52:05,208 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.83 vs. limit=4.8213333333333335 2023-10-09 11:52:06,674 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.70 vs. limit=9.040000000000001 2023-10-09 11:52:19,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.61 vs. limit=6.026666666666667 2023-10-09 11:52:25,642 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=27.31 vs. limit=8.2875 2023-10-09 11:52:25,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.61 vs. limit=9.075 2023-10-09 11:52:26,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2100.0, ans=0.5 2023-10-09 11:52:26,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.51 vs. limit=9.075 2023-10-09 11:52:26,966 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=8.2875 2023-10-09 11:52:28,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.22 vs. limit=6.05 2023-10-09 11:52:30,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=8.2875 2023-10-09 11:52:34,346 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=8.305 2023-10-09 11:52:45,026 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=34.48 vs. limit=8.3225 2023-10-09 11:52:45,121 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.94 vs. limit=9.145 2023-10-09 11:52:47,120 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.07 vs. limit=8.3225 2023-10-09 11:52:51,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.16 vs. limit=9.145 2023-10-09 11:52:55,998 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.04 vs. limit=5.548333333333334 2023-10-09 11:52:58,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.26 vs. limit=5.5600000000000005 2023-10-09 11:53:03,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=2240.0, ans=9.18 2023-10-09 11:53:05,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.66 vs. limit=8.34 2023-10-09 11:53:05,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=2240.0, ans=0.0496 2023-10-09 11:53:07,022 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=9.18 2023-10-09 11:53:20,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2286.6666666666665, ans=0.04855 2023-10-09 11:53:23,671 INFO [train.py:1031] (0/4) Epoch 1, batch 500, loss[loss=0.8601, simple_loss=0.7321, pruned_loss=0.6749, over 16689.00 frames. ], tot_loss[loss=1.276, simple_loss=1.11, pruned_loss=1.149, over 7272230.02 frames. ], batch size: 220, lr: 4.49e-02, grad_scale: 8.0 2023-10-09 11:53:26,972 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.209e+02 1.842e+02 2.459e+02 3.238e+02 5.482e+02, threshold=4.917e+02, percent-clipped=35.0 2023-10-09 11:53:27,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.28 vs. limit=9.25 2023-10-09 11:53:46,463 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.29 vs. limit=9.285 2023-10-09 11:53:52,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.51 vs. limit=8.41 2023-10-09 11:53:55,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2426.6666666666665, ans=6.516666666666667 2023-10-09 11:53:57,327 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.95 vs. limit=6.213333333333333 2023-10-09 11:54:05,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=10.52 vs. limit=9.355 2023-10-09 11:54:10,339 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=4.550e+01 2023-10-09 11:54:10,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.95 vs. limit=5.63 2023-10-09 11:54:13,947 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.49 vs. limit=8.445 2023-10-09 11:54:26,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=2566.6666666666665, ans=0.3796875 2023-10-09 11:54:53,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=9.99 vs. limit=9.495000000000001 2023-10-09 11:54:55,026 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=9.68 vs. limit=9.495000000000001 2023-10-09 11:54:57,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2660.0, ans=0.8069000000000001 2023-10-09 11:55:13,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=9.55 vs. limit=9.565 2023-10-09 11:55:16,662 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.34 vs. limit=6.376666666666667 2023-10-09 11:55:17,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2753.3333333333335, ans=0.37093750000000003 2023-10-09 11:55:27,751 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 2.382e+02 3.435e+02 4.686e+02 1.303e+03, threshold=6.869e+02, percent-clipped=20.0 2023-10-09 11:55:34,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2800.0, ans=0.36875 2023-10-09 11:55:51,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2893.3333333333335, ans=0.04095833333333333 2023-10-09 11:55:52,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2893.3333333333335, ans=0.2710666666666667 2023-10-09 11:56:02,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2940.0, ans=0.3621875 2023-10-09 11:56:09,164 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.25 vs. limit=5.0 2023-10-09 11:56:10,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2986.6666666666665, ans=0.7954666666666667 2023-10-09 11:56:12,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=2986.6666666666665, ans=0.08133333333333334 2023-10-09 11:56:13,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.25 vs. limit=9.74 2023-10-09 11:56:21,997 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.92 vs. limit=5.746666666666666 2023-10-09 11:56:32,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=3033.3333333333335, ans=0.7938333333333334 2023-10-09 11:56:33,932 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.93 vs. limit=9.81 2023-10-09 11:56:33,976 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.49 vs. limit=8.655 2023-10-09 11:56:56,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.79 vs. limit=9.845 2023-10-09 11:57:13,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=3220.0, ans=0.3490625 2023-10-09 11:57:15,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=3220.0, ans=0.3490625 2023-10-09 11:57:23,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.664e+02 4.137e+02 6.547e+02 1.066e+03, threshold=8.274e+02, percent-clipped=23.0 2023-10-09 11:57:46,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=3360.0, ans=0.02439999999999999 2023-10-09 11:57:47,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=3360.0, ans=0.2504 2023-10-09 11:58:03,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=3406.6666666666665, ans=0.03935416666666667 2023-10-09 11:58:10,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=3453.3333333333335, ans=0.07049999999999998 2023-10-09 11:58:16,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=3453.3333333333335, ans=0.338125 2023-10-09 11:58:36,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=3546.6666666666665, ans=0.020199999999999996 2023-10-09 11:58:46,268 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.35 vs. limit=5.437333333333333 2023-10-09 11:58:52,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=3593.3333333333335, ans=0.7742333333333333 2023-10-09 11:59:07,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=3686.6666666666665, ans=0.06174999999999997 2023-10-09 11:59:17,909 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=8.9 2023-10-09 11:59:21,470 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.771e+02 4.569e+02 6.440e+02 2.053e+03, threshold=9.137e+02, percent-clipped=16.0 2023-10-09 11:59:33,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=3780.0, ans=0.027500000000000024 2023-10-09 11:59:36,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=3780.0, ans=0.3228125 2023-10-09 11:59:41,587 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.92 vs. limit=5.530666666666667 2023-10-09 11:59:51,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=3826.6666666666665, ans=0.320625 2023-10-09 11:59:57,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=3873.3333333333335, ans=0.31843750000000004 2023-10-09 12:00:12,275 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=3.588 2023-10-09 12:00:18,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=3920.0, ans=0.7628 2023-10-09 12:00:36,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.94 vs. limit=5.605333333333333 2023-10-09 12:00:38,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=4013.3333333333335, ans=0.07491666666666667 2023-10-09 12:01:00,151 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.35 vs. limit=5.642666666666667 2023-10-09 12:01:03,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=4106.666666666667, ans=0.3075 2023-10-09 12:01:15,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4153.333333333333, ans=0.3053125 2023-10-09 12:01:24,194 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.785e+02 4.442e+02 8.351e+02 2.552e+03, threshold=8.884e+02, percent-clipped=21.0 2023-10-09 12:01:42,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=4293.333333333333, ans=0.04877777777777778 2023-10-09 12:01:43,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=4293.333333333333, ans=0.29874999999999996 2023-10-09 12:02:04,843 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.05 vs. limit=3.651 2023-10-09 12:02:44,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=4526.666666666667, ans=0.2878125 2023-10-09 12:02:54,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=4573.333333333333, ans=0.285625 2023-10-09 12:03:07,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=4620.0, ans=0.2834375 2023-10-09 12:03:13,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=4620.0, ans=0.2834375 2023-10-09 12:03:16,459 INFO [train.py:1031] (0/4) Epoch 1, batch 1000, loss[loss=0.6269, simple_loss=0.5779, pruned_loss=0.3518, over 16959.00 frames. ], tot_loss[loss=0.9504, simple_loss=0.8339, pruned_loss=0.7568, over 12961779.27 frames. ], batch size: 93, lr: 4.48e-02, grad_scale: 8.0 2023-10-09 12:03:18,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=4666.666666666667, ans=0.04722222222222222 2023-10-09 12:03:19,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.26 vs. limit=11.0 2023-10-09 12:03:20,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 3.290e+02 5.657e+02 8.681e+02 2.028e+03, threshold=1.131e+03, percent-clipped=23.0 2023-10-09 12:03:41,665 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.78 vs. limit=11.07 2023-10-09 12:03:44,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4760.0, ans=0.2524 2023-10-09 12:03:52,380 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.79 vs. limit=11.105 2023-10-09 12:04:08,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=4853.333333333333, ans=0.27249999999999996 2023-10-09 12:04:10,626 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.52 vs. limit=11.175 2023-10-09 12:04:19,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4900.0, ans=0.251 2023-10-09 12:04:28,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.72 vs. limit=7.473333333333334 2023-10-09 12:04:34,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=4993.333333333333, ans=0.045861111111111116 2023-10-09 12:04:41,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=5040.0, ans=0.26375000000000004 2023-10-09 12:04:48,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=5040.0, ans=9.39 2023-10-09 12:04:55,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=5086.666666666667, ans=0.26156250000000003 2023-10-09 12:04:55,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.71 vs. limit=11.315 2023-10-09 12:05:05,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=5133.333333333333, ans=0.7203333333333334 2023-10-09 12:05:09,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.896e+02 4.699e+02 7.921e+02 2.032e+03, threshold=9.398e+02, percent-clipped=10.0 2023-10-09 12:05:11,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.73 vs. limit=11.35 2023-10-09 12:05:30,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=5226.666666666667, ans=0.24773333333333333 2023-10-09 12:05:34,487 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.80 vs. limit=11.42 2023-10-09 12:05:40,525 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=9.46 2023-10-09 12:05:44,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=5273.333333333333, ans=0.044694444444444446 2023-10-09 12:05:48,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=5273.333333333333, ans=0.2528125 2023-10-09 12:06:01,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=5320.0, ans=0.044500000000000005 2023-10-09 12:06:18,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=5366.666666666667, ans=0.0 2023-10-09 12:06:28,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=5413.333333333333, ans=0.044111111111111115 2023-10-09 12:06:36,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=5460.0, ans=0.7089000000000001 2023-10-09 12:06:52,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=5506.666666666667, ans=0.043722222222222225 2023-10-09 12:07:16,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.917e+02 4.746e+02 7.001e+02 2.391e+03, threshold=9.492e+02, percent-clipped=17.0 2023-10-09 12:07:31,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.02 vs. limit=9.6175 2023-10-09 12:07:32,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.47 vs. limit=11.735 2023-10-09 12:07:38,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=5693.333333333333, ans=0.23312500000000003 2023-10-09 12:07:40,027 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 12:07:41,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5693.333333333333, ans=0.24306666666666665 2023-10-09 12:08:46,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=5973.333333333333, ans=0.009571014492753624 2023-10-09 12:08:53,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.57 vs. limit=6.493333333333333 2023-10-09 12:08:53,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=9.74 2023-10-09 12:08:53,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=6020.0, ans=0.009560869565217392 2023-10-09 12:09:10,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 3.626e+02 5.347e+02 8.264e+02 2.974e+03, threshold=1.069e+03, percent-clipped=18.0 2023-10-09 12:09:17,361 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.20 vs. limit=3.917 2023-10-09 12:09:26,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=6113.333333333333, ans=0.6860333333333334 2023-10-09 12:09:32,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=6160.0, ans=0.21125 2023-10-09 12:09:35,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=6160.0, ans=0.0 2023-10-09 12:09:51,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=6253.333333333333, ans=0.20687499999999998 2023-10-09 12:09:52,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.08 vs. limit=12.190000000000001 2023-10-09 12:09:53,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=6253.333333333333, ans=0.20687499999999998 2023-10-09 12:10:09,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6300.0, ans=0.237 2023-10-09 12:10:11,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=6346.666666666667, ans=0.0 2023-10-09 12:10:15,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6346.666666666667, ans=0.23653333333333332 2023-10-09 12:10:24,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=6393.333333333333, ans=0.6762333333333334 2023-10-09 12:10:42,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=6440.0, ans=0.029875000000000002 2023-10-09 12:10:50,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=6486.666666666667, ans=0.1959375 2023-10-09 12:10:52,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=6486.666666666667, ans=0.1959375 2023-10-09 12:11:03,691 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=9.95 2023-10-09 12:11:05,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.97 vs. limit=12.4 2023-10-09 12:11:05,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.894e+02 4.335e+02 6.370e+02 1.607e+03, threshold=8.670e+02, percent-clipped=8.0 2023-10-09 12:11:09,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=6533.333333333333, ans=0.19374999999999998 2023-10-09 12:11:13,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=6580.0, ans=0.19156250000000002 2023-10-09 12:11:15,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.36 vs. limit=12.434999999999999 2023-10-09 12:11:21,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=6626.666666666667, ans=0.18937500000000002 2023-10-09 12:11:33,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=6673.333333333333, ans=0.1871875 2023-10-09 12:11:40,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=6673.333333333333, ans=0.1871875 2023-10-09 12:11:40,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=6673.333333333333, ans=0.009418840579710146 2023-10-09 12:12:04,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=6766.666666666667, ans=0.6631666666666667 2023-10-09 12:12:27,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=6860.0, ans=0.03808333333333334 2023-10-09 12:12:38,522 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=12.68 2023-10-09 12:12:48,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=6953.333333333333, ans=0.23046666666666665 2023-10-09 12:12:53,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.15 vs. limit=8.476666666666667 2023-10-09 12:12:53,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=6953.333333333333, ans=0.1740625 2023-10-09 12:12:55,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=7000.0, ans=0.09899494936611666 2023-10-09 12:12:55,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=6.75 2023-10-09 12:12:56,418 INFO [train.py:1031] (0/4) Epoch 1, batch 1500, loss[loss=0.4746, simple_loss=0.4619, pruned_loss=0.2399, over 16930.00 frames. ], tot_loss[loss=0.7741, simple_loss=0.6954, pruned_loss=0.5566, over 17357613.12 frames. ], batch size: 123, lr: 4.46e-02, grad_scale: 8.0 2023-10-09 12:13:01,876 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 3.328e+02 5.079e+02 8.132e+02 1.447e+03, threshold=1.016e+03, percent-clipped=21.0 2023-10-09 12:13:02,447 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.88 vs. limit=4.05 2023-10-09 12:13:05,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=7000.0, ans=0.0 2023-10-09 12:13:52,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=7233.333333333333, ans=0.1609375 2023-10-09 12:13:58,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.89 vs. limit=12.925 2023-10-09 12:14:03,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=7233.333333333333, ans=0.1609375 2023-10-09 12:14:09,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=7280.0, ans=0.2272 2023-10-09 12:14:15,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.56 vs. limit=12.995000000000001 2023-10-09 12:14:35,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=7373.333333333333, ans=0.009266666666666666 2023-10-09 12:14:52,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7466.666666666667, ans=0.22533333333333333 2023-10-09 12:14:55,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.88 vs. limit=13.1 2023-10-09 12:14:56,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 3.125e+02 4.733e+02 7.250e+02 1.707e+03, threshold=9.465e+02, percent-clipped=8.0 2023-10-09 12:15:02,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=7513.333333333333, ans=0.1478125 2023-10-09 12:15:18,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.53 vs. limit=13.17 2023-10-09 12:15:37,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=7653.333333333333, ans=0.14125 2023-10-09 12:15:51,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=7700.0, ans=0.13906249999999998 2023-10-09 12:15:57,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.89 vs. limit=13.275 2023-10-09 12:16:19,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=7793.333333333333, ans=0.13468750000000002 2023-10-09 12:16:30,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=7840.0, ans=0.009165217391304348 2023-10-09 12:16:42,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.78 vs. limit=13.415 2023-10-09 12:16:42,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=7886.666666666667, ans=0.6239666666666667 2023-10-09 12:16:44,987 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=10.475 2023-10-09 12:16:50,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=7933.333333333333, ans=0.128125 2023-10-09 12:16:51,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.00 vs. limit=10.475 2023-10-09 12:16:51,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 3.612e+02 5.663e+02 8.176e+02 1.459e+03, threshold=1.133e+03, percent-clipped=16.0 2023-10-09 12:16:53,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7933.333333333333, ans=0.22066666666666668 2023-10-09 12:16:59,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=7980.0, ans=0.12593749999999998 2023-10-09 12:17:05,291 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=10.4925 2023-10-09 12:17:09,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8026.666666666667, ans=0.21973333333333334 2023-10-09 12:17:12,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.23 vs. limit=10.51 2023-10-09 12:17:13,000 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.59 vs. limit=10.51 2023-10-09 12:17:25,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=8073.333333333333, ans=0.125 2023-10-09 12:17:33,669 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.24 vs. limit=10.545 2023-10-09 12:17:44,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8166.666666666667, ans=0.21833333333333332 2023-10-09 12:17:53,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8213.333333333334, ans=0.21786666666666665 2023-10-09 12:18:09,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=8260.0, ans=0.6109 2023-10-09 12:18:30,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=8353.333333333334, ans=0.125 2023-10-09 12:18:31,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.09 vs. limit=13.765 2023-10-09 12:18:32,342 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=10.6325 2023-10-09 12:18:34,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=8353.333333333334, ans=0.125 2023-10-09 12:18:42,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 3.379e+02 4.819e+02 7.326e+02 1.735e+03, threshold=9.639e+02, percent-clipped=7.0 2023-10-09 12:18:42,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=8400.0, ans=0.125 2023-10-09 12:18:44,058 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.47 vs. limit=10.65 2023-10-09 12:18:58,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.87 vs. limit=10.6675 2023-10-09 12:19:14,542 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.11 vs. limit=10.7025 2023-10-09 12:19:17,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=8540.0, ans=0.125 2023-10-09 12:19:27,976 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.12 vs. limit=10.719999999999999 2023-10-09 12:19:51,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8680.0, ans=0.2132 2023-10-09 12:20:16,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=14.115 2023-10-09 12:20:19,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.92 vs. limit=14.115 2023-10-09 12:20:20,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=8820.0, ans=0.125 2023-10-09 12:20:21,246 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.97 vs. limit=10.807500000000001 2023-10-09 12:20:30,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8866.666666666666, ans=0.21133333333333332 2023-10-09 12:20:32,435 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.144e+02 4.479e+02 6.776e+02 1.167e+03, threshold=8.957e+02, percent-clipped=4.0 2023-10-09 12:20:43,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=8913.333333333334, ans=0.125 2023-10-09 12:20:43,683 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.82 vs. limit=10.8425 2023-10-09 12:20:50,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.79 vs. limit=7.24 2023-10-09 12:20:57,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8960.0, ans=0.2104 2023-10-09 12:20:59,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=9006.666666666666, ans=0.125 2023-10-09 12:21:07,406 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.28 vs. limit=4.351 2023-10-09 12:21:27,678 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.12 vs. limit=14.325 2023-10-09 12:21:44,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.74 vs. limit=9.573333333333334 2023-10-09 12:21:57,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=9193.333333333334, ans=0.5782333333333334 2023-10-09 12:21:59,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=9193.333333333334, ans=0.125 2023-10-09 12:22:00,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=9193.333333333334, ans=0.5782333333333334 2023-10-09 12:22:18,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten.whitening_limit, batch_count=9240.0, ans=14.43 2023-10-09 12:22:22,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=9286.666666666666, ans=0.0 2023-10-09 12:22:22,367 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.00 vs. limit=4.393 2023-10-09 12:22:23,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=9286.666666666666, ans=0.125 2023-10-09 12:22:23,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=9286.666666666666, ans=0.125 2023-10-09 12:22:24,366 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.13 vs. limit=10.9825 2023-10-09 12:22:31,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.39 vs. limit=11.0 2023-10-09 12:22:32,308 INFO [train.py:1031] (0/4) Epoch 1, batch 2000, loss[loss=0.5053, simple_loss=0.5046, pruned_loss=0.253, over 16498.00 frames. ], tot_loss[loss=0.6695, simple_loss=0.6186, pruned_loss=0.444, over 20771114.12 frames. ], batch size: 266, lr: 4.42e-02, grad_scale: 32.0 2023-10-09 12:22:35,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=9333.333333333334, ans=0.125 2023-10-09 12:22:38,667 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 3.210e+02 4.167e+02 6.978e+02 1.336e+03, threshold=8.334e+02, percent-clipped=13.0 2023-10-09 12:22:44,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=9380.0, ans=0.125 2023-10-09 12:23:00,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=9426.666666666666, ans=0.125 2023-10-09 12:23:23,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=9473.333333333334, ans=0.027194444444444445 2023-10-09 12:23:24,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.47 vs. limit=7.789333333333333 2023-10-09 12:23:30,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=9520.0, ans=0.0 2023-10-09 12:23:37,025 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.73 vs. limit=14.64 2023-10-09 12:23:45,901 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 12:24:07,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=9660.0, ans=0.008769565217391305 2023-10-09 12:24:10,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.37 vs. limit=14.745000000000001 2023-10-09 12:24:12,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=9706.666666666666, ans=0.125 2023-10-09 12:24:49,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 3.735e+02 5.977e+02 7.499e+02 1.921e+03, threshold=1.195e+03, percent-clipped=15.0 2023-10-09 12:24:57,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=9846.666666666666, ans=0.0 2023-10-09 12:25:05,983 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 12:25:58,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=10033.333333333334, ans=0.125 2023-10-09 12:26:24,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10126.666666666666, ans=0.19873333333333332 2023-10-09 12:26:29,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=10126.666666666666, ans=0.07 2023-10-09 12:26:30,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=10126.666666666666, ans=0.125 2023-10-09 12:27:03,164 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.37 vs. limit=11.35 2023-10-09 12:27:04,358 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 3.243e+02 4.017e+02 4.955e+02 1.108e+03, threshold=8.035e+02, percent-clipped=0.0 2023-10-09 12:27:07,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=10266.666666666666, ans=0.5406666666666667 2023-10-09 12:27:19,451 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.70 vs. limit=11.3675 2023-10-09 12:27:39,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=10406.666666666666, ans=0.125 2023-10-09 12:27:57,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=10500.0, ans=0.125 2023-10-09 12:28:13,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=10593.333333333334, ans=0.125 2023-10-09 12:28:21,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=10593.333333333334, ans=0.5292333333333334 2023-10-09 12:28:27,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=10640.0, ans=0.04949747468305833 2023-10-09 12:28:51,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.870e+02 3.696e+02 5.862e+02 1.279e+03, threshold=7.393e+02, percent-clipped=10.0 2023-10-09 12:28:58,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=10780.0, ans=0.0 2023-10-09 12:29:27,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=10873.333333333334, ans=0.125 2023-10-09 12:29:34,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=15.69 2023-10-09 12:29:48,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=11.6125 2023-10-09 12:29:52,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=11013.333333333334, ans=0.008475362318840579 2023-10-09 12:30:12,585 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.39 vs. limit=15.795 2023-10-09 12:30:21,152 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.16 vs. limit=15.83 2023-10-09 12:30:28,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.30 vs. limit=4.673 2023-10-09 12:30:44,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.492e+02 4.927e+02 6.350e+02 1.075e+03, threshold=9.853e+02, percent-clipped=18.0 2023-10-09 12:30:54,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.10 vs. limit=11.717500000000001 2023-10-09 12:31:06,576 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.97 vs. limit=11.735 2023-10-09 12:31:11,743 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.41 vs. limit=16.005000000000003 2023-10-09 12:31:14,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11340.0, ans=0.1866 2023-10-09 12:31:29,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.66 vs. limit=16.04 2023-10-09 12:31:35,112 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.86 vs. limit=11.77 2023-10-09 12:31:43,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=11433.333333333334, ans=0.125 2023-10-09 12:31:45,743 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.04 vs. limit=11.7875 2023-10-09 12:31:47,788 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.51 vs. limit=16.075 2023-10-09 12:31:54,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=11480.0, ans=0.125 2023-10-09 12:32:00,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=11480.0, ans=0.4982000000000001 2023-10-09 12:32:04,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.65 vs. limit=11.8225 2023-10-09 12:32:04,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=11526.666666666666, ans=0.00836376811594203 2023-10-09 12:32:11,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=11526.666666666666, ans=0.00836376811594203 2023-10-09 12:32:19,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11573.333333333334, ans=0.18426666666666666 2023-10-09 12:32:19,347 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.61 vs. limit=11.84 2023-10-09 12:32:20,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.10 vs. limit=11.84 2023-10-09 12:32:33,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=11666.666666666666, ans=0.125 2023-10-09 12:32:33,829 INFO [train.py:1031] (0/4) Epoch 1, batch 2500, loss[loss=0.3904, simple_loss=0.4261, pruned_loss=0.1774, over 16589.00 frames. ], tot_loss[loss=0.5998, simple_loss=0.57, pruned_loss=0.3727, over 23419268.45 frames. ], batch size: 56, lr: 4.38e-02, grad_scale: 32.0 2023-10-09 12:32:40,235 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 3.216e+02 4.121e+02 4.962e+02 1.193e+03, threshold=8.242e+02, percent-clipped=2.0 2023-10-09 12:32:56,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=11760.0, ans=0.01766666666666667 2023-10-09 12:33:36,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=11900.0, ans=0.125 2023-10-09 12:33:54,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=11993.333333333334, ans=0.016694444444444442 2023-10-09 12:34:00,385 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.85 vs. limit=8.797333333333334 2023-10-09 12:34:10,810 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.93 vs. limit=16.53 2023-10-09 12:34:14,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=12086.666666666666, ans=0.125 2023-10-09 12:34:21,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.71 vs. limit=4.813 2023-10-09 12:34:30,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.833e+02 3.824e+02 4.828e+02 1.178e+03, threshold=7.649e+02, percent-clipped=4.0 2023-10-09 12:34:58,604 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=8.056666666666667 2023-10-09 12:35:17,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.83 vs. limit=4.848 2023-10-09 12:35:19,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=12320.0, ans=0.125 2023-10-09 12:35:29,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12366.666666666666, ans=0.17633333333333334 2023-10-09 12:35:31,000 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=8.946666666666665 2023-10-09 12:35:31,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=12366.666666666666, ans=0.035 2023-10-09 12:35:32,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.34 vs. limit=16.775 2023-10-09 12:35:45,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=12460.0, ans=0.014750000000000006 2023-10-09 12:36:01,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=12506.666666666666, ans=0.125 2023-10-09 12:36:29,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=12600.0, ans=0.008130434782608695 2023-10-09 12:36:29,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=12600.0, ans=0.125 2023-10-09 12:36:32,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.893e+02 3.711e+02 4.618e+02 8.918e+02, threshold=7.422e+02, percent-clipped=4.0 2023-10-09 12:36:35,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=12600.0, ans=0.174 2023-10-09 12:36:59,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=12693.333333333334, ans=0.125 2023-10-09 12:37:05,246 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.2775 2023-10-09 12:37:25,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.74 vs. limit=8.208333333333334 2023-10-09 12:37:26,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=12833.333333333334, ans=0.8783333333333333 2023-10-09 12:37:39,284 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.11 vs. limit=11.440000000000001 2023-10-09 12:37:39,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=12880.0, ans=0.125 2023-10-09 12:37:40,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=12880.0, ans=0.125 2023-10-09 12:37:43,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.04 vs. limit=12.33 2023-10-09 12:38:02,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.10 vs. limit=17.23 2023-10-09 12:38:31,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.848e+02 3.625e+02 4.845e+02 1.076e+03, threshold=7.249e+02, percent-clipped=7.0 2023-10-09 12:38:31,715 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.63 vs. limit=17.3 2023-10-09 12:38:46,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=13113.333333333334, ans=0.125 2023-10-09 12:38:55,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=13160.0, ans=0.05 2023-10-09 12:38:56,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=13160.0, ans=0.125 2023-10-09 12:39:34,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13300.0, ans=0.16699999999999998 2023-10-09 12:39:36,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=13300.0, ans=0.07 2023-10-09 12:39:42,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=13346.666666666666, ans=0.007968115942028986 2023-10-09 12:39:43,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=13346.666666666666, ans=0.125 2023-10-09 12:40:12,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=13440.0, ans=0.125 2023-10-09 12:40:16,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.96 vs. limit=11.719999999999999 2023-10-09 12:40:18,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=13440.0, ans=0.09899494936611666 2023-10-09 12:40:41,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.784e+02 3.175e+02 4.080e+02 9.135e+02, threshold=6.351e+02, percent-clipped=2.0 2023-10-09 12:40:58,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=13626.666666666666, ans=0.125 2023-10-09 12:41:01,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.18 vs. limit=12.61 2023-10-09 12:41:01,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=13626.666666666666, ans=0.007907246376811594 2023-10-09 12:41:01,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=13626.666666666666, ans=0.125 2023-10-09 12:41:12,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=13673.333333333334, ans=0.125 2023-10-09 12:41:13,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=13673.333333333334, ans=0.125 2023-10-09 12:41:31,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.56 vs. limit=17.79 2023-10-09 12:41:38,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.87 vs. limit=12.6625 2023-10-09 12:41:44,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.84 vs. limit=17.86 2023-10-09 12:41:45,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.33 vs. limit=12.68 2023-10-09 12:41:56,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=13860.0, ans=0.04949747468305833 2023-10-09 12:41:57,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=13860.0, ans=0.0 2023-10-09 12:42:01,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=13860.0, ans=0.00891666666666667 2023-10-09 12:42:02,144 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 12:42:15,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.16 vs. limit=9.562666666666665 2023-10-09 12:42:28,701 INFO [train.py:1031] (0/4) Epoch 1, batch 3000, loss[loss=0.3687, simple_loss=0.415, pruned_loss=0.1612, over 16939.00 frames. ], tot_loss[loss=0.5473, simple_loss=0.5336, pruned_loss=0.3218, over 25496100.05 frames. ], batch size: 93, lr: 4.34e-02, grad_scale: 32.0 2023-10-09 12:42:35,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.881e+02 3.551e+02 4.499e+02 9.501e+02, threshold=7.102e+02, percent-clipped=6.0 2023-10-09 12:42:48,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14046.666666666666, ans=0.15953333333333333 2023-10-09 12:42:51,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14093.333333333334, ans=0.15906666666666666 2023-10-09 12:43:14,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=14186.666666666666, ans=0.025 2023-10-09 12:43:22,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14186.666666666666, ans=0.15813333333333335 2023-10-09 12:43:23,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=14186.666666666666, ans=0.09899494936611666 2023-10-09 12:43:46,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=14280.0, ans=0.125 2023-10-09 12:43:57,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=14326.666666666666, ans=0.39856666666666674 2023-10-09 12:44:00,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=14326.666666666666, ans=0.125 2023-10-09 12:44:06,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=14373.333333333334, ans=0.05 2023-10-09 12:44:07,342 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.89 2023-10-09 12:44:35,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.108e+02 2.862e+02 3.437e+02 4.321e+02 9.893e+02, threshold=6.875e+02, percent-clipped=7.0 2023-10-09 12:44:36,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=14466.666666666666, ans=0.125 2023-10-09 12:44:46,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14513.333333333334, ans=0.15486666666666665 2023-10-09 12:44:50,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=14513.333333333334, ans=0.125 2023-10-09 12:45:15,241 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.07 vs. limit=18.455 2023-10-09 12:45:15,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=14606.666666666666, ans=0.005805555555555557 2023-10-09 12:45:19,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.72 vs. limit=12.995000000000001 2023-10-09 12:45:21,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=14653.333333333334, ans=0.005611111111111108 2023-10-09 12:45:22,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=14653.333333333334, ans=0.125 2023-10-09 12:45:25,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=12.995000000000001 2023-10-09 12:45:28,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=14653.333333333334, ans=0.15346666666666667 2023-10-09 12:45:40,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=14700.0, ans=0.15300000000000002 2023-10-09 12:45:43,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=14746.666666666666, ans=0.3838666666666667 2023-10-09 12:45:54,477 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.60 vs. limit=9.917333333333334 2023-10-09 12:46:17,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=14886.666666666666, ans=0.125 2023-10-09 12:46:20,061 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=12.443333333333332 2023-10-09 12:46:30,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14933.333333333334, ans=0.15066666666666667 2023-10-09 12:46:34,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=14933.333333333334, ans=0.42400000000000004 2023-10-09 12:46:35,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.754e+02 3.405e+02 4.364e+02 7.679e+02, threshold=6.811e+02, percent-clipped=2.0 2023-10-09 12:46:36,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=14933.333333333334, ans=0.125 2023-10-09 12:46:55,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=14980.0, ans=0.0 2023-10-09 12:47:13,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=15073.333333333334, ans=0.125 2023-10-09 12:47:23,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15120.0, ans=0.14880000000000002 2023-10-09 12:47:25,837 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=18.84 2023-10-09 12:47:32,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=15120.0, ans=0.3708 2023-10-09 12:47:32,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=15120.0, ans=0.0036666666666666722 2023-10-09 12:47:35,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=15166.666666666666, ans=0.125 2023-10-09 12:47:39,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=15166.666666666666, ans=0.125 2023-10-09 12:48:14,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=15306.666666666666, ans=0.0028888888888888853 2023-10-09 12:48:18,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=15306.666666666666, ans=0.125 2023-10-09 12:48:41,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.108e+02 2.718e+02 3.206e+02 4.092e+02 6.646e+02, threshold=6.413e+02, percent-clipped=0.0 2023-10-09 12:49:07,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=15540.0, ans=0.125 2023-10-09 12:49:17,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=15540.0, ans=0.35609999999999997 2023-10-09 12:49:24,433 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 12:49:28,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=15586.666666666666, ans=0.125 2023-10-09 12:49:33,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=15633.333333333334, ans=0.125 2023-10-09 12:49:43,366 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=10.272 2023-10-09 12:49:55,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=15726.666666666666, ans=0.00745072463768116 2023-10-09 12:50:01,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=15726.666666666666, ans=0.125 2023-10-09 12:50:24,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.22 vs. limit=19.365000000000002 2023-10-09 12:50:31,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=15866.666666666666, ans=0.05 2023-10-09 12:50:38,234 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.754e+02 3.493e+02 4.424e+02 8.379e+02, threshold=6.986e+02, percent-clipped=8.0 2023-10-09 12:50:44,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=15913.333333333334, ans=0.14086666666666667 2023-10-09 12:50:56,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=15960.0, ans=0.0 2023-10-09 12:51:23,876 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=13.52 2023-10-09 12:51:59,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=16193.333333333334, ans=0.125 2023-10-09 12:52:02,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=16193.333333333334, ans=0.0073492753623188405 2023-10-09 12:52:06,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=16240.0, ans=0.0073391304347826085 2023-10-09 12:52:20,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=16286.666666666666, ans=0.125 2023-10-09 12:52:22,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=16286.666666666666, ans=0.125 2023-10-09 12:52:32,063 INFO [train.py:1031] (0/4) Epoch 1, batch 3500, loss[loss=0.3639, simple_loss=0.4134, pruned_loss=0.1572, over 16942.00 frames. ], tot_loss[loss=0.5084, simple_loss=0.5073, pruned_loss=0.2851, over 27098526.41 frames. ], batch size: 77, lr: 4.28e-02, grad_scale: 64.0 2023-10-09 12:52:35,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=16333.333333333334, ans=0.07 2023-10-09 12:52:37,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.762e+02 3.443e+02 4.370e+02 9.215e+02, threshold=6.885e+02, percent-clipped=8.0 2023-10-09 12:52:59,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=16426.666666666668, ans=0.125 2023-10-09 12:53:04,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=16426.666666666668, ans=0.007298550724637681 2023-10-09 12:53:12,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=16473.333333333332, ans=0.125 2023-10-09 12:53:15,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=16473.333333333332, ans=0.0 2023-10-09 12:53:24,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=16520.0, ans=0.0072782608695652175 2023-10-09 12:53:41,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=16613.333333333332, ans=0.007257971014492754 2023-10-09 12:53:53,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=16613.333333333332, ans=0.125 2023-10-09 12:54:24,423 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.08 vs. limit=20.064999999999998 2023-10-09 12:54:32,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16800.0, ans=0.132 2023-10-09 12:54:41,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.710e+02 2.550e+02 3.099e+02 3.752e+02 6.234e+02, threshold=6.198e+02, percent-clipped=0.0 2023-10-09 12:54:49,460 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 12:54:52,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=16846.666666666668, ans=0.125 2023-10-09 12:54:55,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=16846.666666666668, ans=0.125 2023-10-09 12:55:01,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=16893.333333333332, ans=0.125 2023-10-09 12:55:28,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=16986.666666666668, ans=0.0 2023-10-09 12:55:33,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=16986.666666666668, ans=0.125 2023-10-09 12:55:58,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17080.0, ans=0.1292 2023-10-09 12:56:04,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=17126.666666666668, ans=0.30056666666666676 2023-10-09 12:56:11,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=17173.333333333332, ans=0.125 2023-10-09 12:56:20,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=17173.333333333332, ans=0.2989333333333334 2023-10-09 12:56:31,151 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.29 vs. limit=5.583 2023-10-09 12:56:35,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=17266.666666666668, ans=0.007115942028985507 2023-10-09 12:56:40,236 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.558e+02 3.187e+02 4.054e+02 9.096e+02, threshold=6.373e+02, percent-clipped=3.0 2023-10-09 12:56:53,404 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.61 vs. limit=13.9925 2023-10-09 12:57:18,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=17406.666666666668, ans=0.125 2023-10-09 12:57:36,703 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=14.044999999999998 2023-10-09 12:58:08,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17593.333333333332, ans=0.12406666666666669 2023-10-09 12:58:25,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17640.0, ans=0.12360000000000002 2023-10-09 12:58:32,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=17686.666666666668, ans=0.125 2023-10-09 12:58:33,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=17686.666666666668, ans=0.125 2023-10-09 12:58:33,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17686.666666666668, ans=0.12313333333333332 2023-10-09 12:58:50,111 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=14.15 2023-10-09 12:58:50,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 2.714e+02 3.272e+02 4.158e+02 7.795e+02, threshold=6.544e+02, percent-clipped=4.0 2023-10-09 12:58:52,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=17733.333333333332, ans=0.12266666666666667 2023-10-09 12:58:53,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=17780.0, ans=0.125 2023-10-09 12:58:55,962 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=5.110e-03 2023-10-09 12:58:56,215 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.35 vs. limit=20.835 2023-10-09 12:59:01,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=17780.0, ans=0.0 2023-10-09 12:59:15,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17826.666666666668, ans=0.1217333333333333 2023-10-09 12:59:27,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=17873.333333333332, ans=0.125 2023-10-09 12:59:50,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=17966.666666666668, ans=0.9296666666666666 2023-10-09 13:00:36,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=18153.333333333332, ans=0.0320816666666667 2023-10-09 13:00:37,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=18153.333333333332, ans=0.0 2023-10-09 13:00:50,262 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.714e+02 3.196e+02 4.125e+02 7.091e+02, threshold=6.391e+02, percent-clipped=5.0 2023-10-09 13:01:08,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.32 vs. limit=14.36 2023-10-09 13:01:31,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18386.666666666668, ans=0.11613333333333331 2023-10-09 13:01:37,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=18386.666666666668, ans=0.125 2023-10-09 13:01:59,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18480.0, ans=0.11520000000000002 2023-10-09 13:02:08,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.02 vs. limit=9.631666666666668 2023-10-09 13:02:09,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=18526.666666666668, ans=0.2515666666666667 2023-10-09 13:02:18,224 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:02:25,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=18620.0, ans=0.0638 2023-10-09 13:02:27,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.10 vs. limit=21.465 2023-10-09 13:02:36,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=18666.666666666668, ans=0.125 2023-10-09 13:02:36,861 INFO [train.py:1031] (0/4) Epoch 1, batch 4000, loss[loss=0.3803, simple_loss=0.4116, pruned_loss=0.1745, over 16429.00 frames. ], tot_loss[loss=0.4781, simple_loss=0.4868, pruned_loss=0.2573, over 28357357.53 frames. ], batch size: 50, lr: 4.23e-02, grad_scale: 32.0 2023-10-09 13:02:42,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=18666.666666666668, ans=0.2466666666666667 2023-10-09 13:02:44,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.836e+02 2.666e+02 3.323e+02 4.176e+02 7.295e+02, threshold=6.645e+02, percent-clipped=3.0 2023-10-09 13:02:47,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18666.666666666668, ans=0.11333333333333331 2023-10-09 13:03:02,780 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:03:04,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=18760.0, ans=0.125 2023-10-09 13:03:18,176 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:03:34,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=18853.333333333332, ans=0.006771014492753623 2023-10-09 13:03:39,012 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=14.5875 2023-10-09 13:03:39,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=18900.0, ans=0.09899494936611666 2023-10-09 13:03:47,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=18946.666666666668, ans=0.125 2023-10-09 13:03:47,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18946.666666666668, ans=0.11053333333333332 2023-10-09 13:03:51,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.41 vs. limit=14.473333333333334 2023-10-09 13:04:02,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.78 vs. limit=14.622499999999999 2023-10-09 13:04:04,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=18993.333333333332, ans=0.006740579710144927 2023-10-09 13:04:10,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=19040.0, ans=0.0 2023-10-09 13:04:11,069 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:04:19,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=19086.666666666668, ans=0.125 2023-10-09 13:04:26,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.70 vs. limit=14.657499999999999 2023-10-09 13:04:30,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=19133.333333333332, ans=0.2303333333333334 2023-10-09 13:04:32,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=19133.333333333332, ans=10.0 2023-10-09 13:04:37,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.715e+02 3.196e+02 3.648e+02 6.112e+02, threshold=6.393e+02, percent-clipped=0.0 2023-10-09 13:04:40,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=19180.0, ans=0.022584999999999994 2023-10-09 13:04:46,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=19180.0, ans=0.07 2023-10-09 13:04:56,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19226.666666666668, ans=0.10773333333333335 2023-10-09 13:04:58,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=19226.666666666668, ans=0.22706666666666675 2023-10-09 13:05:02,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=19226.666666666668, ans=0.07 2023-10-09 13:05:14,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=19273.333333333332, ans=0.125 2023-10-09 13:05:19,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=19320.0, ans=0.2238 2023-10-09 13:05:21,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=19320.0, ans=0.006669565217391304 2023-10-09 13:05:27,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=19366.666666666668, ans=0.125 2023-10-09 13:05:31,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=19366.666666666668, ans=0.4905 2023-10-09 13:05:39,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.20 vs. limit=9.853333333333332 2023-10-09 13:06:16,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=19506.666666666668, ans=0.21726666666666672 2023-10-09 13:06:31,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=19553.333333333332, ans=0.21563333333333345 2023-10-09 13:06:33,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=19553.333333333332, ans=0.0 2023-10-09 13:06:46,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.515e+02 2.991e+02 3.638e+02 5.967e+02, threshold=5.982e+02, percent-clipped=0.0 2023-10-09 13:06:52,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=19646.666666666668, ans=0.0 2023-10-09 13:06:55,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19646.666666666668, ans=0.10353333333333334 2023-10-09 13:07:01,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=19693.333333333332, ans=0.125 2023-10-09 13:07:20,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.74 vs. limit=11.896 2023-10-09 13:07:20,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=19740.0, ans=0.125 2023-10-09 13:07:25,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=19786.666666666668, ans=0.125 2023-10-09 13:07:28,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=19786.666666666668, ans=0.125 2023-10-09 13:07:30,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=19786.666666666668, ans=0.0 2023-10-09 13:07:38,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=19833.333333333332, ans=0.125 2023-10-09 13:07:42,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=19833.333333333332, ans=0.006557971014492754 2023-10-09 13:08:19,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=20020.0, ans=0.125 2023-10-09 13:08:30,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=20066.666666666668, ans=0.125 2023-10-09 13:08:35,326 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.584e+02 3.031e+02 3.766e+02 5.460e+02, threshold=6.061e+02, percent-clipped=0.0 2023-10-09 13:08:59,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=20160.0, ans=0.125 2023-10-09 13:09:05,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=20206.666666666668, ans=0.125 2023-10-09 13:09:06,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=20206.666666666668, ans=0.95 2023-10-09 13:09:08,870 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=12.19 vs. limit=15.0 2023-10-09 13:09:19,437 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.77 vs. limit=15.0 2023-10-09 13:09:21,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20253.333333333332, ans=0.1 2023-10-09 13:09:25,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=20300.0, ans=0.125 2023-10-09 13:09:38,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=20346.666666666668, ans=0.1 2023-10-09 13:10:09,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=20440.0, ans=0.2 2023-10-09 13:10:15,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=20486.666666666668, ans=0.2 2023-10-09 13:10:29,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.759e+02 3.156e+02 3.836e+02 6.218e+02, threshold=6.312e+02, percent-clipped=1.0 2023-10-09 13:10:30,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=20533.333333333332, ans=0.0 2023-10-09 13:10:37,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20580.0, ans=0.1 2023-10-09 13:10:39,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=20580.0, ans=0.125 2023-10-09 13:10:41,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=20580.0, ans=0.2 2023-10-09 13:10:46,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=20626.666666666668, ans=0.1 2023-10-09 13:10:58,115 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-10-09 13:11:44,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=20813.333333333332, ans=0.125 2023-10-09 13:11:51,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=20860.0, ans=0.125 2023-10-09 13:12:01,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=20860.0, ans=0.1 2023-10-09 13:12:28,361 INFO [train.py:1031] (0/4) Epoch 1, batch 4500, loss[loss=0.3344, simple_loss=0.3935, pruned_loss=0.1376, over 16879.00 frames. ], tot_loss[loss=0.4549, simple_loss=0.4717, pruned_loss=0.2361, over 29344575.74 frames. ], batch size: 138, lr: 4.17e-02, grad_scale: 32.0 2023-10-09 13:12:36,666 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.417e+02 3.022e+02 3.679e+02 7.488e+02, threshold=6.044e+02, percent-clipped=5.0 2023-10-09 13:12:42,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21046.666666666668, ans=0.1 2023-10-09 13:13:16,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=21186.666666666668, ans=0.0 2023-10-09 13:13:34,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=21233.333333333332, ans=0.2 2023-10-09 13:13:49,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=21280.0, ans=0.125 2023-10-09 13:13:49,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-10-09 13:13:55,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=21326.666666666668, ans=0.006233333333333334 2023-10-09 13:14:00,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=21373.333333333332, ans=0.0 2023-10-09 13:14:32,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.468e+02 3.069e+02 3.788e+02 6.070e+02, threshold=6.138e+02, percent-clipped=1.0 2023-10-09 13:14:37,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=21513.333333333332, ans=0.006192753623188406 2023-10-09 13:14:37,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-10-09 13:14:39,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=21513.333333333332, ans=0.0 2023-10-09 13:14:46,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=21560.0, ans=10.0 2023-10-09 13:14:51,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=21560.0, ans=0.006182608695652174 2023-10-09 13:15:17,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21653.333333333332, ans=0.1 2023-10-09 13:15:20,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=21653.333333333332, ans=0.125 2023-10-09 13:15:48,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=21793.333333333332, ans=0.125 2023-10-09 13:15:49,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=21793.333333333332, ans=0.04949747468305833 2023-10-09 13:15:53,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=21793.333333333332, ans=0.125 2023-10-09 13:16:01,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=21840.0, ans=0.125 2023-10-09 13:16:04,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.98 vs. limit=15.0 2023-10-09 13:16:11,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=21886.666666666668, ans=0.05 2023-10-09 13:16:12,013 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.65 vs. limit=15.0 2023-10-09 13:16:20,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=21933.333333333332, ans=0.125 2023-10-09 13:16:20,250 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.44 vs. limit=15.0 2023-10-09 13:16:24,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.532e+02 2.894e+02 3.340e+02 5.628e+02, threshold=5.787e+02, percent-clipped=0.0 2023-10-09 13:16:34,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=21980.0, ans=0.0 2023-10-09 13:16:36,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=21980.0, ans=0.125 2023-10-09 13:17:00,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=22120.0, ans=0.2 2023-10-09 13:17:07,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=22120.0, ans=0.125 2023-10-09 13:17:10,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=22166.666666666668, ans=0.125 2023-10-09 13:17:19,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=22166.666666666668, ans=0.04949747468305833 2023-10-09 13:17:21,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=22213.333333333332, ans=0.006040579710144928 2023-10-09 13:17:25,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22213.333333333332, ans=0.1 2023-10-09 13:17:31,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=22213.333333333332, ans=0.125 2023-10-09 13:17:53,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=22353.333333333332, ans=0.125 2023-10-09 13:18:12,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.07 vs. limit=15.0 2023-10-09 13:18:13,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.459e+02 2.800e+02 3.330e+02 5.526e+02, threshold=5.599e+02, percent-clipped=0.0 2023-10-09 13:18:22,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=22446.666666666668, ans=0.2 2023-10-09 13:19:02,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=12.0 2023-10-09 13:19:22,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=22680.0, ans=0.0 2023-10-09 13:19:28,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=22726.666666666668, ans=0.125 2023-10-09 13:19:29,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.09 vs. limit=15.0 2023-10-09 13:19:34,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=22726.666666666668, ans=0.0 2023-10-09 13:19:41,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=22773.333333333332, ans=0.2 2023-10-09 13:19:46,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=22820.0, ans=0.125 2023-10-09 13:20:03,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=22866.666666666668, ans=0.005898550724637681 2023-10-09 13:20:07,495 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.530e+02 2.781e+02 3.399e+02 7.715e+02, threshold=5.563e+02, percent-clipped=3.0 2023-10-09 13:20:39,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=23006.666666666668, ans=0.125 2023-10-09 13:20:52,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=23053.333333333332, ans=0.0 2023-10-09 13:20:57,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=23053.333333333332, ans=0.07 2023-10-09 13:21:14,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=23146.666666666668, ans=0.125 2023-10-09 13:21:34,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=23193.333333333332, ans=0.2 2023-10-09 13:21:35,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=23193.333333333332, ans=0.0 2023-10-09 13:21:37,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=23240.0, ans=0.125 2023-10-09 13:22:01,750 INFO [train.py:1031] (0/4) Epoch 1, batch 5000, loss[loss=0.4121, simple_loss=0.4427, pruned_loss=0.1907, over 16634.00 frames. ], tot_loss[loss=0.4373, simple_loss=0.4598, pruned_loss=0.2203, over 30096463.79 frames. ], batch size: 241, lr: 4.10e-02, grad_scale: 32.0 2023-10-09 13:22:06,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=23333.333333333332, ans=0.005797101449275363 2023-10-09 13:22:09,803 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 2.559e+02 3.026e+02 3.665e+02 5.958e+02, threshold=6.051e+02, percent-clipped=2.0 2023-10-09 13:22:25,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=23426.666666666668, ans=0.125 2023-10-09 13:22:35,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=23473.333333333332, ans=0.125 2023-10-09 13:22:54,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2023-10-09 13:23:02,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23566.666666666668, ans=0.1 2023-10-09 13:23:08,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=23566.666666666668, ans=0.125 2023-10-09 13:23:16,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=23613.333333333332, ans=0.125 2023-10-09 13:23:22,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=23660.0, ans=0.025 2023-10-09 13:23:26,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=23660.0, ans=0.125 2023-10-09 13:23:33,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.65 vs. limit=15.0 2023-10-09 13:24:06,866 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.522e+02 3.065e+02 3.596e+02 6.594e+02, threshold=6.131e+02, percent-clipped=1.0 2023-10-09 13:24:17,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=23846.666666666668, ans=0.125 2023-10-09 13:24:22,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=23893.333333333332, ans=0.125 2023-10-09 13:24:45,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=23986.666666666668, ans=0.125 2023-10-09 13:24:59,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=24033.333333333332, ans=0.125 2023-10-09 13:25:07,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=24033.333333333332, ans=0.125 2023-10-09 13:25:10,988 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.34 vs. limit=15.0 2023-10-09 13:25:17,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=24080.0, ans=0.125 2023-10-09 13:25:18,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.09 vs. limit=22.5 2023-10-09 13:25:22,839 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:25:28,100 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:25:37,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=24173.333333333332, ans=0.0 2023-10-09 13:26:03,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.584e+02 2.927e+02 3.538e+02 5.245e+02, threshold=5.854e+02, percent-clipped=0.0 2023-10-09 13:26:10,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=24313.333333333332, ans=0.125 2023-10-09 13:26:37,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=24406.666666666668, ans=0.125 2023-10-09 13:26:51,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=24500.0, ans=0.005543478260869566 2023-10-09 13:27:54,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=24733.333333333332, ans=0.125 2023-10-09 13:27:59,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.391e+02 2.671e+02 3.164e+02 6.020e+02, threshold=5.342e+02, percent-clipped=1.0 2023-10-09 13:28:23,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=24826.666666666668, ans=0.5 2023-10-09 13:28:29,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=24826.666666666668, ans=0.09899494936611666 2023-10-09 13:28:33,204 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:28:39,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=24873.333333333332, ans=0.125 2023-10-09 13:28:42,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=24873.333333333332, ans=0.0 2023-10-09 13:28:46,730 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.75 vs. limit=6.0 2023-10-09 13:28:47,598 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.74 vs. limit=10.0 2023-10-09 13:28:55,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.89 vs. limit=15.0 2023-10-09 13:29:06,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=24966.666666666668, ans=0.125 2023-10-09 13:29:06,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=24966.666666666668, ans=0.0 2023-10-09 13:29:16,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=25013.333333333332, ans=0.2 2023-10-09 13:29:20,988 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.66 vs. limit=10.0 2023-10-09 13:29:29,935 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.08 vs. limit=15.0 2023-10-09 13:29:47,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=25153.333333333332, ans=0.125 2023-10-09 13:30:06,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.489e+02 2.857e+02 3.222e+02 5.031e+02, threshold=5.715e+02, percent-clipped=0.0 2023-10-09 13:30:26,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.28 vs. limit=10.0 2023-10-09 13:30:27,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=25293.333333333332, ans=0.125 2023-10-09 13:30:29,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=25293.333333333332, ans=0.1 2023-10-09 13:30:51,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=25386.666666666668, ans=0.125 2023-10-09 13:30:51,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=25386.666666666668, ans=0.0 2023-10-09 13:31:03,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=25433.333333333332, ans=0.125 2023-10-09 13:31:28,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=25526.666666666668, ans=0.2 2023-10-09 13:31:28,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=25526.666666666668, ans=0.07 2023-10-09 13:31:35,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=25573.333333333332, ans=0.05 2023-10-09 13:31:52,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=25620.0, ans=0.125 2023-10-09 13:31:54,463 INFO [train.py:1031] (0/4) Epoch 1, batch 5500, loss[loss=0.3567, simple_loss=0.407, pruned_loss=0.1532, over 16875.00 frames. ], tot_loss[loss=0.4224, simple_loss=0.45, pruned_loss=0.2073, over 30684344.59 frames. ], batch size: 82, lr: 4.04e-02, grad_scale: 32.0 2023-10-09 13:31:54,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=25666.666666666668, ans=0.2 2023-10-09 13:32:01,554 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.632e+02 2.940e+02 3.489e+02 6.640e+02, threshold=5.880e+02, percent-clipped=1.0 2023-10-09 13:32:01,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=25666.666666666668, ans=0.0 2023-10-09 13:32:11,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=25713.333333333332, ans=0.035 2023-10-09 13:32:33,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.45 vs. limit=15.0 2023-10-09 13:32:44,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=25853.333333333332, ans=0.1 2023-10-09 13:32:49,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=25900.0, ans=10.0 2023-10-09 13:32:49,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=25900.0, ans=0.125 2023-10-09 13:32:51,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=25900.0, ans=0.125 2023-10-09 13:33:10,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=25946.666666666668, ans=0.2 2023-10-09 13:33:18,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=25993.333333333332, ans=0.125 2023-10-09 13:33:28,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.12 vs. limit=15.0 2023-10-09 13:33:32,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=26040.0, ans=0.125 2023-10-09 13:33:40,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=15.0 2023-10-09 13:33:44,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=26133.333333333332, ans=0.125 2023-10-09 13:33:51,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.435e+02 2.773e+02 3.403e+02 4.938e+02, threshold=5.546e+02, percent-clipped=0.0 2023-10-09 13:33:56,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=26180.0, ans=0.125 2023-10-09 13:33:59,926 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.13 vs. limit=10.0 2023-10-09 13:34:06,139 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.99 vs. limit=15.0 2023-10-09 13:34:23,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.84 vs. limit=15.0 2023-10-09 13:34:49,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=26366.666666666668, ans=0.125 2023-10-09 13:34:52,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=26413.333333333332, ans=0.005127536231884058 2023-10-09 13:35:14,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=26506.666666666668, ans=0.125 2023-10-09 13:35:15,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.13 vs. limit=15.0 2023-10-09 13:35:16,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26506.666666666668, ans=0.1 2023-10-09 13:35:24,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=26553.333333333332, ans=0.125 2023-10-09 13:35:34,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=26553.333333333332, ans=0.125 2023-10-09 13:35:39,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26600.0, ans=0.1 2023-10-09 13:35:41,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.20 vs. limit=15.0 2023-10-09 13:35:44,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.429e+02 2.842e+02 3.278e+02 4.829e+02, threshold=5.683e+02, percent-clipped=0.0 2023-10-09 13:35:56,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.56 vs. limit=22.5 2023-10-09 13:35:58,681 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.83 vs. limit=22.5 2023-10-09 13:36:36,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=26833.333333333332, ans=0.125 2023-10-09 13:37:01,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=26926.666666666668, ans=0.0 2023-10-09 13:37:04,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=26926.666666666668, ans=0.1 2023-10-09 13:37:29,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=27020.0, ans=0.0 2023-10-09 13:37:34,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=27066.666666666668, ans=0.125 2023-10-09 13:37:39,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=27066.666666666668, ans=0.125 2023-10-09 13:37:41,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.506e+02 2.905e+02 3.814e+02 6.231e+02, threshold=5.809e+02, percent-clipped=4.0 2023-10-09 13:37:53,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=27113.333333333332, ans=0.0 2023-10-09 13:38:11,472 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.62 vs. limit=22.5 2023-10-09 13:38:47,908 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.78 vs. limit=6.0 2023-10-09 13:38:49,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.34 vs. limit=15.0 2023-10-09 13:38:56,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=15.0 2023-10-09 13:39:00,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=27393.333333333332, ans=0.125 2023-10-09 13:39:07,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=27440.0, ans=0.0 2023-10-09 13:39:10,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=27440.0, ans=0.125 2023-10-09 13:39:33,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.674e+02 3.191e+02 3.498e+02 4.882e+02, threshold=6.383e+02, percent-clipped=0.0 2023-10-09 13:39:37,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=27580.0, ans=0.1 2023-10-09 13:39:42,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=27580.0, ans=0.125 2023-10-09 13:39:55,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=27626.666666666668, ans=0.2 2023-10-09 13:39:57,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=27626.666666666668, ans=0.125 2023-10-09 13:39:57,250 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.76 vs. limit=15.0 2023-10-09 13:40:02,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=27673.333333333332, ans=0.2 2023-10-09 13:40:10,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=27673.333333333332, ans=0.09899494936611666 2023-10-09 13:40:11,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.05 vs. limit=22.5 2023-10-09 13:40:16,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.20 vs. limit=15.0 2023-10-09 13:40:17,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=27720.0, ans=0.125 2023-10-09 13:40:22,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.25 vs. limit=22.5 2023-10-09 13:40:42,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=27813.333333333332, ans=0.2 2023-10-09 13:41:00,076 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.59 vs. limit=22.5 2023-10-09 13:41:21,077 INFO [train.py:1031] (0/4) Epoch 1, batch 6000, loss[loss=0.3985, simple_loss=0.4364, pruned_loss=0.1802, over 16634.00 frames. ], tot_loss[loss=0.4114, simple_loss=0.4429, pruned_loss=0.1975, over 31150972.78 frames. ], batch size: 61, lr: 3.98e-02, grad_scale: 32.0 2023-10-09 13:41:24,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28000.0, ans=0.1 2023-10-09 13:41:28,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.439e+02 2.949e+02 3.440e+02 5.726e+02, threshold=5.899e+02, percent-clipped=0.0 2023-10-09 13:41:30,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=28000.0, ans=0.125 2023-10-09 13:41:34,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=28046.666666666668, ans=0.0 2023-10-09 13:41:55,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=28140.0, ans=0.0 2023-10-09 13:42:00,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=28140.0, ans=0.0 2023-10-09 13:42:29,541 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=30.68 vs. limit=22.5 2023-10-09 13:42:31,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=28280.0, ans=0.125 2023-10-09 13:42:43,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=28326.666666666668, ans=22.5 2023-10-09 13:43:20,046 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.429e+02 2.787e+02 3.328e+02 4.921e+02, threshold=5.573e+02, percent-clipped=0.0 2023-10-09 13:43:23,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=22.5 2023-10-09 13:43:30,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=28513.333333333332, ans=0.125 2023-10-09 13:43:38,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28560.0, ans=0.1 2023-10-09 13:43:40,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=28560.0, ans=0.004660869565217392 2023-10-09 13:43:47,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=28606.666666666668, ans=0.0 2023-10-09 13:44:17,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=28700.0, ans=0.2 2023-10-09 13:44:43,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=28793.333333333332, ans=0.1 2023-10-09 13:44:51,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=15.0 2023-10-09 13:45:04,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=20.07 vs. limit=15.0 2023-10-09 13:45:09,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=28886.666666666668, ans=0.125 2023-10-09 13:45:17,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=28933.333333333332, ans=0.125 2023-10-09 13:45:21,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.517e+02 2.954e+02 3.598e+02 5.663e+02, threshold=5.908e+02, percent-clipped=1.0 2023-10-09 13:45:31,429 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.36 vs. limit=22.5 2023-10-09 13:45:32,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=28980.0, ans=0.125 2023-10-09 13:46:14,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=29166.666666666668, ans=0.1 2023-10-09 13:46:20,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=29166.666666666668, ans=10.0 2023-10-09 13:46:24,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=29213.333333333332, ans=0.125 2023-10-09 13:47:14,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=29400.0, ans=0.2 2023-10-09 13:47:15,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.458e+02 2.665e+02 3.145e+02 5.633e+02, threshold=5.329e+02, percent-clipped=0.0 2023-10-09 13:47:17,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29400.0, ans=0.1 2023-10-09 13:47:47,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=29540.0, ans=0.2 2023-10-09 13:47:58,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=29586.666666666668, ans=0.125 2023-10-09 13:48:11,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=29633.333333333332, ans=0.004427536231884059 2023-10-09 13:48:19,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=29680.0, ans=0.004417391304347826 2023-10-09 13:48:25,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=29680.0, ans=0.0 2023-10-09 13:48:37,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=29726.666666666668, ans=0.125 2023-10-09 13:48:43,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29726.666666666668, ans=0.1 2023-10-09 13:48:47,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29726.666666666668, ans=0.1 2023-10-09 13:48:51,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=29773.333333333332, ans=0.125 2023-10-09 13:49:04,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29820.0, ans=0.1 2023-10-09 13:49:08,775 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=12.0 2023-10-09 13:49:15,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=29866.666666666668, ans=0.1 2023-10-09 13:49:20,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=29866.666666666668, ans=0.125 2023-10-09 13:49:22,196 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.316e+02 2.629e+02 2.979e+02 6.107e+02, threshold=5.258e+02, percent-clipped=1.0 2023-10-09 13:49:37,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=29913.333333333332, ans=0.125 2023-10-09 13:50:02,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=30006.666666666668, ans=0.125 2023-10-09 13:50:28,247 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.52 vs. limit=15.0 2023-10-09 13:50:37,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=30193.333333333332, ans=0.0 2023-10-09 13:51:04,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=30286.666666666668, ans=0.125 2023-10-09 13:51:13,436 INFO [train.py:1031] (0/4) Epoch 1, batch 6500, loss[loss=0.4134, simple_loss=0.4513, pruned_loss=0.1877, over 16699.00 frames. ], tot_loss[loss=0.4025, simple_loss=0.437, pruned_loss=0.1898, over 31520612.54 frames. ], batch size: 202, lr: 3.91e-02, grad_scale: 32.0 2023-10-09 13:51:15,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=30333.333333333332, ans=0.0 2023-10-09 13:51:18,702 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.55 vs. limit=15.0 2023-10-09 13:51:21,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.425e+02 2.754e+02 3.269e+02 4.851e+02, threshold=5.508e+02, percent-clipped=0.0 2023-10-09 13:51:59,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=30473.333333333332, ans=0.125 2023-10-09 13:52:06,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=30473.333333333332, ans=0.04949747468305833 2023-10-09 13:52:10,922 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:52:39,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=30613.333333333332, ans=0.015 2023-10-09 13:52:45,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=30660.0, ans=0.125 2023-10-09 13:52:48,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=30660.0, ans=0.125 2023-10-09 13:52:49,892 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.59 vs. limit=22.5 2023-10-09 13:53:06,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.94 vs. limit=6.0 2023-10-09 13:53:28,105 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 2.544e+02 2.969e+02 3.595e+02 5.241e+02, threshold=5.937e+02, percent-clipped=0.0 2023-10-09 13:53:43,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30893.333333333332, ans=0.1 2023-10-09 13:53:47,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=30893.333333333332, ans=0.125 2023-10-09 13:53:53,017 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.65 vs. limit=15.0 2023-10-09 13:53:58,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=30940.0, ans=0.5 2023-10-09 13:54:03,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.95 vs. limit=22.5 2023-10-09 13:54:20,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.80 vs. limit=15.0 2023-10-09 13:54:53,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.16 vs. limit=15.0 2023-10-09 13:55:04,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=31173.333333333332, ans=0.125 2023-10-09 13:55:17,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=31266.666666666668, ans=0.125 2023-10-09 13:55:23,970 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.591e+02 2.941e+02 3.296e+02 5.536e+02, threshold=5.882e+02, percent-clipped=0.0 2023-10-09 13:55:30,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=31313.333333333332, ans=0.125 2023-10-09 13:55:32,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=31313.333333333332, ans=0.125 2023-10-09 13:55:33,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=31313.333333333332, ans=0.0 2023-10-09 13:55:51,722 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.84 vs. limit=10.0 2023-10-09 13:55:59,173 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.33 vs. limit=15.0 2023-10-09 13:55:59,229 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.64 vs. limit=10.0 2023-10-09 13:56:10,687 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.07 vs. limit=6.0 2023-10-09 13:56:19,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=31500.0, ans=0.125 2023-10-09 13:56:20,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.06 vs. limit=6.0 2023-10-09 13:56:24,206 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 13:57:09,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=31686.666666666668, ans=0.0 2023-10-09 13:57:29,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.466e+02 2.912e+02 3.541e+02 6.118e+02, threshold=5.825e+02, percent-clipped=1.0 2023-10-09 13:57:48,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=31780.0, ans=0.0 2023-10-09 13:57:54,995 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.40 vs. limit=6.0 2023-10-09 13:57:58,904 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.12 vs. limit=12.0 2023-10-09 13:58:07,301 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.67 vs. limit=22.5 2023-10-09 13:58:08,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31873.333333333332, ans=0.1 2023-10-09 13:58:29,559 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-09 13:58:33,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=31966.666666666668, ans=0.125 2023-10-09 13:58:58,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=32060.0, ans=0.05 2023-10-09 13:59:00,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=32060.0, ans=0.05 2023-10-09 13:59:11,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=32106.666666666668, ans=0.125 2023-10-09 13:59:23,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=32153.333333333332, ans=0.125 2023-10-09 13:59:30,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=32153.333333333332, ans=0.125 2023-10-09 13:59:43,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.377e+02 2.677e+02 3.003e+02 4.420e+02, threshold=5.354e+02, percent-clipped=0.0 2023-10-09 13:59:47,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=32246.666666666668, ans=0.2 2023-10-09 14:00:41,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=32433.333333333332, ans=0.0038188405797101458 2023-10-09 14:00:52,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=32480.0, ans=0.2 2023-10-09 14:01:04,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.01 vs. limit=22.5 2023-10-09 14:01:26,916 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.81 vs. limit=15.0 2023-10-09 14:01:35,300 INFO [train.py:1031] (0/4) Epoch 1, batch 7000, loss[loss=0.405, simple_loss=0.4475, pruned_loss=0.1813, over 16961.00 frames. ], tot_loss[loss=0.3939, simple_loss=0.4316, pruned_loss=0.1826, over 31807013.99 frames. ], batch size: 123, lr: 3.85e-02, grad_scale: 32.0 2023-10-09 14:01:35,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=32666.666666666668, ans=0.003768115942028985 2023-10-09 14:01:44,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.723e+02 2.964e+02 3.517e+02 5.853e+02, threshold=5.928e+02, percent-clipped=1.0 2023-10-09 14:01:51,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.84 vs. limit=12.0 2023-10-09 14:02:06,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-10-09 14:02:15,576 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-10-09 14:03:13,286 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:03:39,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=33133.333333333336, ans=0.125 2023-10-09 14:03:39,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=33133.333333333336, ans=0.125 2023-10-09 14:03:42,399 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.378e+02 2.771e+02 3.167e+02 5.382e+02, threshold=5.541e+02, percent-clipped=0.0 2023-10-09 14:04:16,801 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:04:22,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33320.0, ans=0.1 2023-10-09 14:04:22,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=33320.0, ans=0.125 2023-10-09 14:04:34,218 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:04:41,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=33366.666666666664, ans=0.125 2023-10-09 14:05:07,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=33460.0, ans=0.0 2023-10-09 14:05:08,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=33506.666666666664, ans=0.125 2023-10-09 14:05:14,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=33506.666666666664, ans=0.0 2023-10-09 14:05:23,670 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.28 vs. limit=15.0 2023-10-09 14:05:39,009 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.56 vs. limit=15.0 2023-10-09 14:05:52,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.826e+02 2.465e+02 2.887e+02 3.467e+02 4.858e+02, threshold=5.773e+02, percent-clipped=0.0 2023-10-09 14:06:04,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.33 vs. limit=15.0 2023-10-09 14:06:10,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=33646.666666666664, ans=0.125 2023-10-09 14:06:10,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=33646.666666666664, ans=0.125 2023-10-09 14:06:37,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.95 vs. limit=22.5 2023-10-09 14:06:38,847 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.55 vs. limit=22.5 2023-10-09 14:06:45,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=33786.666666666664, ans=0.125 2023-10-09 14:06:49,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=33786.666666666664, ans=0.2 2023-10-09 14:06:54,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=33786.666666666664, ans=0.0 2023-10-09 14:07:17,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=33880.0, ans=0.0035043478260869563 2023-10-09 14:07:27,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33926.666666666664, ans=0.1 2023-10-09 14:07:37,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=33973.333333333336, ans=0.125 2023-10-09 14:07:46,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=33973.333333333336, ans=0.125 2023-10-09 14:08:06,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.420e+02 2.746e+02 3.159e+02 5.758e+02, threshold=5.493e+02, percent-clipped=0.0 2023-10-09 14:08:39,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=34206.666666666664, ans=0.2 2023-10-09 14:08:45,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=34206.666666666664, ans=0.125 2023-10-09 14:08:53,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=34253.333333333336, ans=0.1 2023-10-09 14:09:01,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=34300.0, ans=0.125 2023-10-09 14:09:01,849 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.39 vs. limit=15.0 2023-10-09 14:09:06,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34300.0, ans=0.1 2023-10-09 14:09:11,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=34346.666666666664, ans=0.125 2023-10-09 14:09:15,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=34346.666666666664, ans=0.125 2023-10-09 14:10:08,189 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 2.266e+02 2.571e+02 2.962e+02 5.200e+02, threshold=5.142e+02, percent-clipped=0.0 2023-10-09 14:10:08,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34533.333333333336, ans=0.1 2023-10-09 14:10:17,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=34580.0, ans=0.125 2023-10-09 14:10:20,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=34580.0, ans=0.0 2023-10-09 14:10:22,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=34580.0, ans=0.125 2023-10-09 14:10:22,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=34580.0, ans=0.2 2023-10-09 14:10:34,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.61 vs. limit=5.0 2023-10-09 14:10:36,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=34673.333333333336, ans=0.125 2023-10-09 14:10:50,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=34720.0, ans=0.0 2023-10-09 14:11:07,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=34813.333333333336, ans=0.125 2023-10-09 14:11:11,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=34813.333333333336, ans=0.125 2023-10-09 14:11:12,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=34813.333333333336, ans=0.1 2023-10-09 14:11:21,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=34860.0, ans=0.025 2023-10-09 14:11:52,987 INFO [train.py:1031] (0/4) Epoch 1, batch 7500, loss[loss=0.3804, simple_loss=0.421, pruned_loss=0.1699, over 16740.00 frames. ], tot_loss[loss=0.3875, simple_loss=0.4273, pruned_loss=0.1773, over 32035184.33 frames. ], batch size: 202, lr: 3.78e-02, grad_scale: 32.0 2023-10-09 14:11:59,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=35000.0, ans=0.0 2023-10-09 14:12:01,566 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.617e+02 2.954e+02 3.488e+02 4.638e+02, threshold=5.909e+02, percent-clipped=0.0 2023-10-09 14:12:14,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=35046.666666666664, ans=0.2 2023-10-09 14:12:15,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=35093.333333333336, ans=0.0 2023-10-09 14:12:15,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=35093.333333333336, ans=0.1 2023-10-09 14:12:18,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=35093.333333333336, ans=0.0032405797101449276 2023-10-09 14:12:24,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35093.333333333336, ans=0.1 2023-10-09 14:12:40,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=35186.666666666664, ans=0.0 2023-10-09 14:13:14,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=35326.666666666664, ans=0.2 2023-10-09 14:13:16,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=35326.666666666664, ans=0.95 2023-10-09 14:13:26,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=35373.333333333336, ans=0.0031797101449275358 2023-10-09 14:13:46,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=35466.666666666664, ans=0.2 2023-10-09 14:13:50,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=35466.666666666664, ans=0.0 2023-10-09 14:13:51,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.430e+02 2.756e+02 3.175e+02 5.846e+02, threshold=5.511e+02, percent-clipped=0.0 2023-10-09 14:13:53,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=35466.666666666664, ans=0.125 2023-10-09 14:14:07,466 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-10-09 14:14:13,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35560.0, ans=0.1 2023-10-09 14:14:34,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=35606.666666666664, ans=0.2 2023-10-09 14:14:39,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=35653.333333333336, ans=0.125 2023-10-09 14:14:57,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=35700.0, ans=0.09899494936611666 2023-10-09 14:15:04,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=35746.666666666664, ans=0.5 2023-10-09 14:15:15,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=35746.666666666664, ans=0.2 2023-10-09 14:15:18,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-10-09 14:15:19,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=35793.333333333336, ans=0.125 2023-10-09 14:15:19,125 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:15:19,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=35793.333333333336, ans=0.0030884057971014497 2023-10-09 14:15:31,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.19 vs. limit=15.0 2023-10-09 14:15:46,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=35886.666666666664, ans=0.0 2023-10-09 14:16:00,334 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:16:04,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.407e+02 2.784e+02 3.305e+02 6.434e+02, threshold=5.568e+02, percent-clipped=1.0 2023-10-09 14:16:28,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=36073.333333333336, ans=0.125 2023-10-09 14:16:48,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=36120.0, ans=0.125 2023-10-09 14:16:50,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=36166.666666666664, ans=0.125 2023-10-09 14:16:58,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=36166.666666666664, ans=0.1 2023-10-09 14:17:08,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=36213.333333333336, ans=0.125 2023-10-09 14:17:20,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=36260.0, ans=0.125 2023-10-09 14:17:33,106 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:17:38,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=36306.666666666664, ans=0.125 2023-10-09 14:17:38,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=36306.666666666664, ans=0.125 2023-10-09 14:17:48,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=36353.333333333336, ans=0.0 2023-10-09 14:17:52,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=36400.0, ans=0.125 2023-10-09 14:17:54,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=36400.0, ans=0.125 2023-10-09 14:17:59,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.500e+02 2.917e+02 3.377e+02 6.020e+02, threshold=5.835e+02, percent-clipped=2.0 2023-10-09 14:18:01,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=36400.0, ans=0.07 2023-10-09 14:18:06,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=36446.666666666664, ans=0.04949747468305833 2023-10-09 14:18:21,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=36493.333333333336, ans=0.002936231884057971 2023-10-09 14:18:31,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=36540.0, ans=0.002926086956521738 2023-10-09 14:18:32,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=36540.0, ans=0.125 2023-10-09 14:18:52,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=36633.333333333336, ans=0.1 2023-10-09 14:19:18,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-10-09 14:19:42,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=36820.0, ans=0.0 2023-10-09 14:20:03,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.447e+02 2.750e+02 3.178e+02 4.670e+02, threshold=5.500e+02, percent-clipped=0.0 2023-10-09 14:20:44,495 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.88 vs. limit=15.0 2023-10-09 14:20:56,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=37100.0, ans=0.125 2023-10-09 14:21:10,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=37146.666666666664, ans=0.125 2023-10-09 14:21:11,291 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.23 vs. limit=15.0 2023-10-09 14:21:17,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=37146.666666666664, ans=0.5 2023-10-09 14:21:27,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=37193.333333333336, ans=0.0 2023-10-09 14:21:49,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=37286.666666666664, ans=0.07 2023-10-09 14:21:55,541 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=27.06 vs. limit=22.5 2023-10-09 14:21:57,992 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-8000.pt 2023-10-09 14:22:01,636 INFO [train.py:1031] (0/4) Epoch 1, batch 8000, loss[loss=0.3013, simple_loss=0.3694, pruned_loss=0.1166, over 16908.00 frames. ], tot_loss[loss=0.3805, simple_loss=0.4226, pruned_loss=0.1719, over 32201113.52 frames. ], batch size: 123, lr: 3.72e-02, grad_scale: 32.0 2023-10-09 14:22:09,332 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 2.324e+02 2.653e+02 3.381e+02 4.972e+02, threshold=5.305e+02, percent-clipped=0.0 2023-10-09 14:22:39,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=37473.333333333336, ans=0.125 2023-10-09 14:23:15,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=37613.333333333336, ans=0.2 2023-10-09 14:23:15,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=37613.333333333336, ans=0.0026927536231884045 2023-10-09 14:23:19,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=37613.333333333336, ans=0.125 2023-10-09 14:23:23,156 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:23:24,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=37660.0, ans=0.0 2023-10-09 14:23:34,579 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=8.0 2023-10-09 14:23:57,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=37800.0, ans=0.125 2023-10-09 14:24:00,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=37800.0, ans=0.0026521739130434784 2023-10-09 14:24:01,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.589e+02 2.920e+02 3.234e+02 4.423e+02, threshold=5.840e+02, percent-clipped=0.0 2023-10-09 14:24:24,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=37893.333333333336, ans=0.0 2023-10-09 14:24:25,680 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=15.0 2023-10-09 14:24:31,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=37940.0, ans=0.125 2023-10-09 14:24:54,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=38033.333333333336, ans=0.09899494936611666 2023-10-09 14:25:06,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=38033.333333333336, ans=0.125 2023-10-09 14:25:11,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=38080.0, ans=0.125 2023-10-09 14:25:13,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=38080.0, ans=0.025 2023-10-09 14:25:37,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=38126.666666666664, ans=0.0025811594202898554 2023-10-09 14:25:44,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=38173.333333333336, ans=0.125 2023-10-09 14:25:46,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=38173.333333333336, ans=0.0025710144927536234 2023-10-09 14:25:52,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=38220.0, ans=0.125 2023-10-09 14:25:59,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=38220.0, ans=10.0 2023-10-09 14:26:06,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=38266.666666666664, ans=0.0 2023-10-09 14:26:18,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.304e+02 2.730e+02 3.039e+02 3.935e+02, threshold=5.461e+02, percent-clipped=0.0 2023-10-09 14:26:26,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=38313.333333333336, ans=0.2 2023-10-09 14:26:30,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=38313.333333333336, ans=0.125 2023-10-09 14:26:35,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=38360.0, ans=0.0 2023-10-09 14:26:50,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=38406.666666666664, ans=0.125 2023-10-09 14:27:09,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=38500.0, ans=0.125 2023-10-09 14:27:14,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=38500.0, ans=0.125 2023-10-09 14:27:17,646 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=12.0 2023-10-09 14:27:19,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=38546.666666666664, ans=0.125 2023-10-09 14:27:32,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=38593.333333333336, ans=0.0024797101449275365 2023-10-09 14:27:33,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.69 vs. limit=22.5 2023-10-09 14:28:14,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.320e+02 2.691e+02 3.231e+02 4.759e+02, threshold=5.382e+02, percent-clipped=0.0 2023-10-09 14:28:53,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=38920.0, ans=0.2 2023-10-09 14:28:54,507 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:29:13,036 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-10-09 14:29:19,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=39013.333333333336, ans=0.125 2023-10-09 14:29:20,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=39013.333333333336, ans=0.2 2023-10-09 14:29:25,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=39060.0, ans=0.1 2023-10-09 14:29:34,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=39060.0, ans=0.2 2023-10-09 14:29:35,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=39060.0, ans=0.002378260869565217 2023-10-09 14:29:36,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=39060.0, ans=0.125 2023-10-09 14:29:39,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.38 vs. limit=10.0 2023-10-09 14:29:44,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=39106.666666666664, ans=0.125 2023-10-09 14:30:08,390 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-10-09 14:30:14,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.488e+02 2.819e+02 3.413e+02 5.214e+02, threshold=5.637e+02, percent-clipped=0.0 2023-10-09 14:30:23,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=39246.666666666664, ans=0.125 2023-10-09 14:30:31,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=39293.333333333336, ans=0.09899494936611666 2023-10-09 14:30:32,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=39293.333333333336, ans=0.125 2023-10-09 14:30:51,954 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.57 vs. limit=15.0 2023-10-09 14:30:54,717 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.80 vs. limit=22.5 2023-10-09 14:30:59,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=39386.666666666664, ans=0.2 2023-10-09 14:31:00,954 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.25 vs. limit=12.0 2023-10-09 14:31:08,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=39433.333333333336, ans=0.125 2023-10-09 14:31:13,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=39433.333333333336, ans=0.2 2023-10-09 14:31:36,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=39526.666666666664, ans=0.125 2023-10-09 14:31:38,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=39526.666666666664, ans=0.2 2023-10-09 14:31:41,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=39573.333333333336, ans=0.2 2023-10-09 14:31:50,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=39573.333333333336, ans=0.1 2023-10-09 14:32:08,194 INFO [train.py:1031] (0/4) Epoch 1, batch 8500, loss[loss=0.3918, simple_loss=0.4319, pruned_loss=0.1758, over 16325.00 frames. ], tot_loss[loss=0.3753, simple_loss=0.4194, pruned_loss=0.1677, over 32376797.69 frames. ], batch size: 50, lr: 3.66e-02, grad_scale: 32.0 2023-10-09 14:32:17,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.643e+02 2.917e+02 3.410e+02 6.077e+02, threshold=5.834e+02, percent-clipped=2.0 2023-10-09 14:32:21,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=39713.333333333336, ans=0.125 2023-10-09 14:32:28,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=39713.333333333336, ans=0.125 2023-10-09 14:32:51,176 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:33:02,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=39853.333333333336, ans=0.002205797101449275 2023-10-09 14:33:15,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-10-09 14:33:40,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=39993.333333333336, ans=0.125 2023-10-09 14:33:45,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40040.0, ans=0.1 2023-10-09 14:34:31,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40133.333333333336, ans=0.1 2023-10-09 14:34:33,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.506e+02 2.837e+02 3.411e+02 4.932e+02, threshold=5.674e+02, percent-clipped=0.0 2023-10-09 14:34:49,550 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:34:49,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=40226.666666666664, ans=0.125 2023-10-09 14:35:14,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-10-09 14:35:19,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=40320.0, ans=0.125 2023-10-09 14:35:28,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=40366.666666666664, ans=0.125 2023-10-09 14:35:32,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=40366.666666666664, ans=0.125 2023-10-09 14:35:49,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=40413.333333333336, ans=0.0 2023-10-09 14:35:49,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=40413.333333333336, ans=0.0 2023-10-09 14:35:49,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40413.333333333336, ans=0.1 2023-10-09 14:36:38,705 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.88 vs. limit=15.0 2023-10-09 14:36:45,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 2.105e+02 2.424e+02 2.837e+02 5.498e+02, threshold=4.848e+02, percent-clipped=0.0 2023-10-09 14:37:25,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=40786.666666666664, ans=0.05 2023-10-09 14:37:31,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=40786.666666666664, ans=0.2 2023-10-09 14:37:58,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-10-09 14:38:20,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=40973.333333333336, ans=0.125 2023-10-09 14:38:57,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.334e+02 2.630e+02 3.066e+02 4.793e+02, threshold=5.261e+02, percent-clipped=0.0 2023-10-09 14:39:01,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.13 vs. limit=15.0 2023-10-09 14:39:05,491 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-10-09 14:39:06,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=41113.333333333336, ans=0.2 2023-10-09 14:39:36,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=41253.333333333336, ans=0.2 2023-10-09 14:39:52,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=41300.0, ans=0.0 2023-10-09 14:39:56,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41346.666666666664, ans=0.1 2023-10-09 14:39:57,476 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.09 vs. limit=22.5 2023-10-09 14:40:13,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.20 vs. limit=6.0 2023-10-09 14:40:32,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=41440.0, ans=0.125 2023-10-09 14:40:49,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=41533.333333333336, ans=0.05 2023-10-09 14:40:51,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.35 vs. limit=12.0 2023-10-09 14:40:56,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.621e+02 2.989e+02 3.508e+02 5.740e+02, threshold=5.978e+02, percent-clipped=1.0 2023-10-09 14:40:56,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=41533.333333333336, ans=0.0018405797101449274 2023-10-09 14:40:58,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=41580.0, ans=0.125 2023-10-09 14:41:05,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=41580.0, ans=0.125 2023-10-09 14:41:29,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.30 vs. limit=15.0 2023-10-09 14:41:32,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=41673.333333333336, ans=0.0 2023-10-09 14:41:49,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=41766.666666666664, ans=0.125 2023-10-09 14:42:35,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=41953.333333333336, ans=0.0017492753623188396 2023-10-09 14:42:44,012 INFO [train.py:1031] (0/4) Epoch 1, batch 9000, loss[loss=0.3491, simple_loss=0.4017, pruned_loss=0.1483, over 16278.00 frames. ], tot_loss[loss=0.3697, simple_loss=0.4153, pruned_loss=0.1637, over 32427525.74 frames. ], batch size: 50, lr: 3.60e-02, grad_scale: 32.0 2023-10-09 14:42:48,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=42000.0, ans=0.0017391304347826094 2023-10-09 14:42:52,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42000.0, ans=0.1 2023-10-09 14:42:52,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.488e+02 2.853e+02 3.353e+02 4.545e+02, threshold=5.705e+02, percent-clipped=0.0 2023-10-09 14:43:33,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=42186.666666666664, ans=0.1 2023-10-09 14:43:38,559 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=15.0 2023-10-09 14:43:44,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=42233.333333333336, ans=0.125 2023-10-09 14:43:48,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=42233.333333333336, ans=0.125 2023-10-09 14:44:40,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.328e+02 2.634e+02 3.216e+02 4.598e+02, threshold=5.268e+02, percent-clipped=0.0 2023-10-09 14:44:58,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=42560.0, ans=0.125 2023-10-09 14:45:07,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-10-09 14:45:23,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=42653.333333333336, ans=0.2 2023-10-09 14:45:24,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42653.333333333336, ans=0.1 2023-10-09 14:45:29,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=42700.0, ans=0.125 2023-10-09 14:45:33,012 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.36 vs. limit=15.0 2023-10-09 14:45:42,193 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:46:04,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.50 vs. limit=22.5 2023-10-09 14:46:19,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=42886.666666666664, ans=0.0015463768115942036 2023-10-09 14:46:32,239 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.256e+02 2.626e+02 3.016e+02 5.567e+02, threshold=5.252e+02, percent-clipped=1.0 2023-10-09 14:46:34,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.11 vs. limit=22.5 2023-10-09 14:46:49,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43026.666666666664, ans=0.1 2023-10-09 14:46:49,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=43026.666666666664, ans=0.2 2023-10-09 14:47:06,613 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.53 vs. limit=15.0 2023-10-09 14:47:15,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=43120.0, ans=0.125 2023-10-09 14:47:21,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=43166.666666666664, ans=0.0014855072463768118 2023-10-09 14:47:22,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=43166.666666666664, ans=0.0014855072463768118 2023-10-09 14:47:40,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=43213.333333333336, ans=0.025 2023-10-09 14:47:59,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=43306.666666666664, ans=0.125 2023-10-09 14:48:08,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=43353.333333333336, ans=0.0 2023-10-09 14:48:27,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=43400.0, ans=0.0 2023-10-09 14:48:28,228 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.326e+02 2.606e+02 2.984e+02 6.740e+02, threshold=5.211e+02, percent-clipped=2.0 2023-10-09 14:49:01,733 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.31 vs. limit=22.5 2023-10-09 14:49:10,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=43586.666666666664, ans=0.125 2023-10-09 14:49:23,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43633.333333333336, ans=0.1 2023-10-09 14:49:32,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=43680.0, ans=0.0 2023-10-09 14:49:37,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=43680.0, ans=0.2 2023-10-09 14:49:47,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=43726.666666666664, ans=0.95 2023-10-09 14:49:56,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=43726.666666666664, ans=0.0013637681159420299 2023-10-09 14:49:58,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=43726.666666666664, ans=0.0 2023-10-09 14:50:05,327 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.38 vs. limit=12.0 2023-10-09 14:50:11,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=43820.0, ans=0.2 2023-10-09 14:50:19,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=43820.0, ans=0.125 2023-10-09 14:50:22,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43820.0, ans=0.1 2023-10-09 14:50:25,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.67 vs. limit=22.5 2023-10-09 14:50:33,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 2.309e+02 2.707e+02 3.237e+02 5.607e+02, threshold=5.413e+02, percent-clipped=2.0 2023-10-09 14:50:39,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=43913.333333333336, ans=0.125 2023-10-09 14:50:52,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.95 vs. limit=22.5 2023-10-09 14:50:59,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=44006.666666666664, ans=0.125 2023-10-09 14:51:05,096 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.13 vs. limit=15.0 2023-10-09 14:51:12,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44053.333333333336, ans=0.1 2023-10-09 14:51:35,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=44146.666666666664, ans=0.125 2023-10-09 14:52:01,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=44240.0, ans=0.125 2023-10-09 14:52:21,662 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.82 vs. limit=15.0 2023-10-09 14:52:22,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=44286.666666666664, ans=0.2 2023-10-09 14:52:23,947 INFO [train.py:1031] (0/4) Epoch 1, batch 9500, loss[loss=0.3722, simple_loss=0.4229, pruned_loss=0.1607, over 16834.00 frames. ], tot_loss[loss=0.3673, simple_loss=0.414, pruned_loss=0.1616, over 32527873.05 frames. ], batch size: 188, lr: 3.54e-02, grad_scale: 32.0 2023-10-09 14:52:26,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=44333.333333333336, ans=0.125 2023-10-09 14:52:32,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 2.425e+02 2.907e+02 3.525e+02 5.224e+02, threshold=5.814e+02, percent-clipped=0.0 2023-10-09 14:52:38,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=44380.0, ans=0.125 2023-10-09 14:52:43,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=44380.0, ans=0.2 2023-10-09 14:52:54,867 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-10-09 14:52:54,968 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.63 vs. limit=22.5 2023-10-09 14:53:00,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=44473.333333333336, ans=0.125 2023-10-09 14:53:09,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=44473.333333333336, ans=0.0012014492753623183 2023-10-09 14:53:19,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=44520.0, ans=0.05 2023-10-09 14:53:26,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=44566.666666666664, ans=0.2 2023-10-09 14:53:40,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.48 vs. limit=6.0 2023-10-09 14:54:31,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=44800.0, ans=0.2 2023-10-09 14:54:54,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.395e+02 2.754e+02 3.361e+02 6.650e+02, threshold=5.508e+02, percent-clipped=2.0 2023-10-09 14:55:26,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.49 vs. limit=15.0 2023-10-09 14:55:38,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=44986.666666666664, ans=0.0 2023-10-09 14:56:05,575 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 14:56:26,235 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.64 vs. limit=15.0 2023-10-09 14:56:29,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=45173.333333333336, ans=0.125 2023-10-09 14:56:40,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=45220.0, ans=0.1 2023-10-09 14:56:42,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=45266.666666666664, ans=0.125 2023-10-09 14:56:54,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.698e+02 2.264e+02 2.576e+02 2.952e+02 4.704e+02, threshold=5.152e+02, percent-clipped=0.0 2023-10-09 14:56:57,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=45313.333333333336, ans=0.125 2023-10-09 14:56:59,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=45313.333333333336, ans=0.2 2023-10-09 14:57:01,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=45313.333333333336, ans=0.125 2023-10-09 14:57:01,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=45313.333333333336, ans=0.0 2023-10-09 14:57:04,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=45313.333333333336, ans=0.125 2023-10-09 14:57:07,828 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.86 vs. limit=22.5 2023-10-09 14:57:13,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45360.0, ans=0.1 2023-10-09 14:57:16,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=45360.0, ans=0.125 2023-10-09 14:57:18,700 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-10-09 14:57:27,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=45406.666666666664, ans=0.0009985507246376823 2023-10-09 14:57:34,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=45453.333333333336, ans=0.5 2023-10-09 14:57:34,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-10-09 14:57:48,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=45500.0, ans=0.125 2023-10-09 14:57:55,356 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=15.0 2023-10-09 14:57:59,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=45546.666666666664, ans=0.125 2023-10-09 14:58:15,049 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-10-09 14:58:18,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=45640.0, ans=0.0 2023-10-09 14:58:23,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=45640.0, ans=0.035 2023-10-09 14:58:24,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=45640.0, ans=0.0009478260869565207 2023-10-09 14:58:30,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=45686.666666666664, ans=0.125 2023-10-09 14:58:34,737 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.83 vs. limit=15.0 2023-10-09 14:58:49,326 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.431e+02 2.945e+02 3.236e+02 4.641e+02, threshold=5.891e+02, percent-clipped=0.0 2023-10-09 14:58:50,892 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2023-10-09 14:58:59,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45780.0, ans=0.1 2023-10-09 14:59:16,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=45873.333333333336, ans=0.125 2023-10-09 14:59:33,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=45920.0, ans=0.0 2023-10-09 14:59:35,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=45920.0, ans=0.125 2023-10-09 14:59:38,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.56 vs. limit=10.0 2023-10-09 14:59:42,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=45966.666666666664, ans=0.09899494936611666 2023-10-09 14:59:43,912 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.31 vs. limit=15.0 2023-10-09 14:59:50,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=46013.333333333336, ans=0.05 2023-10-09 15:00:23,568 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-10-09 15:00:30,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=46153.333333333336, ans=0.0008362318840579707 2023-10-09 15:00:37,579 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.50 vs. limit=15.0 2023-10-09 15:00:47,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.469e+02 2.886e+02 3.459e+02 7.037e+02, threshold=5.772e+02, percent-clipped=1.0 2023-10-09 15:00:51,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46246.666666666664, ans=0.1 2023-10-09 15:01:00,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46293.333333333336, ans=0.1 2023-10-09 15:01:17,252 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.81 vs. limit=6.0 2023-10-09 15:01:24,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=46386.666666666664, ans=22.5 2023-10-09 15:01:41,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=46480.0, ans=0.1 2023-10-09 15:01:42,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=46480.0, ans=0.125 2023-10-09 15:01:53,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=46526.666666666664, ans=0.0 2023-10-09 15:01:58,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=46526.666666666664, ans=0.125 2023-10-09 15:02:15,657 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.23 vs. limit=15.0 2023-10-09 15:02:17,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=46620.0, ans=0.125 2023-10-09 15:02:21,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=6.0 2023-10-09 15:02:23,115 INFO [train.py:1031] (0/4) Epoch 1, batch 10000, loss[loss=0.4555, simple_loss=0.4477, pruned_loss=0.2317, over 15590.00 frames. ], tot_loss[loss=0.3627, simple_loss=0.4104, pruned_loss=0.1584, over 32566769.93 frames. ], batch size: 350, lr: 3.49e-02, grad_scale: 32.0 2023-10-09 15:02:23,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.05 vs. limit=15.0 2023-10-09 15:02:24,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=46666.666666666664, ans=0.125 2023-10-09 15:02:32,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=46666.666666666664, ans=0.0007246376811594207 2023-10-09 15:02:34,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.382e+02 2.689e+02 3.468e+02 5.379e+02, threshold=5.378e+02, percent-clipped=0.0 2023-10-09 15:02:52,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.49 vs. limit=22.5 2023-10-09 15:02:54,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46760.0, ans=0.1 2023-10-09 15:03:13,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-10-09 15:03:27,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=46900.0, ans=0.0006739130434782609 2023-10-09 15:03:34,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=46946.666666666664, ans=0.125 2023-10-09 15:03:34,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=46946.666666666664, ans=0.2 2023-10-09 15:03:48,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=46993.333333333336, ans=10.0 2023-10-09 15:03:49,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.06 vs. limit=6.0 2023-10-09 15:03:55,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=46993.333333333336, ans=0.0 2023-10-09 15:04:15,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=47086.666666666664, ans=0.0 2023-10-09 15:04:35,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=47133.333333333336, ans=0.125 2023-10-09 15:04:38,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=47133.333333333336, ans=0.125 2023-10-09 15:04:42,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 2.373e+02 2.713e+02 3.157e+02 6.043e+02, threshold=5.425e+02, percent-clipped=1.0 2023-10-09 15:04:49,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=47180.0, ans=0.125 2023-10-09 15:04:53,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47226.666666666664, ans=0.1 2023-10-09 15:05:06,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.88 vs. limit=12.0 2023-10-09 15:05:14,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-10-09 15:05:29,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47320.0, ans=0.1 2023-10-09 15:05:32,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=47366.666666666664, ans=0.2 2023-10-09 15:05:40,099 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.47 vs. limit=15.0 2023-10-09 15:06:24,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=47553.333333333336, ans=0.125 2023-10-09 15:06:36,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=47600.0, ans=0.125 2023-10-09 15:06:44,215 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-10-09 15:06:44,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 2.304e+02 2.627e+02 3.209e+02 6.157e+02, threshold=5.254e+02, percent-clipped=2.0 2023-10-09 15:07:00,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2023-10-09 15:07:07,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=47693.333333333336, ans=0.125 2023-10-09 15:07:23,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=47740.0, ans=0.125 2023-10-09 15:07:30,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-10-09 15:07:43,784 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.66 vs. limit=10.0 2023-10-09 15:08:01,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=47833.333333333336, ans=0.00047101449275362313 2023-10-09 15:08:27,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=47926.666666666664, ans=0.09899494936611666 2023-10-09 15:08:35,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=47973.333333333336, ans=0.02 2023-10-09 15:08:39,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=47973.333333333336, ans=0.1 2023-10-09 15:08:47,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=48020.0, ans=0.125 2023-10-09 15:08:47,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.36 vs. limit=15.0 2023-10-09 15:08:49,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=48020.0, ans=0.125 2023-10-09 15:09:09,578 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.269e+02 2.617e+02 2.997e+02 4.172e+02, threshold=5.235e+02, percent-clipped=0.0 2023-10-09 15:09:38,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48206.666666666664, ans=0.1 2023-10-09 15:09:53,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=48253.333333333336, ans=0.125 2023-10-09 15:09:59,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=48253.333333333336, ans=0.125 2023-10-09 15:10:13,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=48300.0, ans=0.125 2023-10-09 15:10:17,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=48346.666666666664, ans=0.1 2023-10-09 15:10:17,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=48346.666666666664, ans=0.125 2023-10-09 15:10:34,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=48393.333333333336, ans=0.0 2023-10-09 15:11:06,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=48486.666666666664, ans=0.125 2023-10-09 15:11:24,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.304e+02 2.627e+02 3.071e+02 4.931e+02, threshold=5.255e+02, percent-clipped=0.0 2023-10-09 15:11:33,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=48580.0, ans=0.125 2023-10-09 15:11:52,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=48673.333333333336, ans=0.00028840579710144934 2023-10-09 15:12:00,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=48673.333333333336, ans=0.0 2023-10-09 15:12:06,264 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.37 vs. limit=10.0 2023-10-09 15:12:21,280 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:12:21,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=48766.666666666664, ans=0.2 2023-10-09 15:12:29,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=48813.333333333336, ans=0.2 2023-10-09 15:12:31,361 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.56 vs. limit=15.0 2023-10-09 15:12:34,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=48813.333333333336, ans=0.2 2023-10-09 15:12:36,124 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.57 vs. limit=5.0 2023-10-09 15:12:43,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=48860.0, ans=0.0 2023-10-09 15:13:07,831 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:13:15,234 INFO [train.py:1031] (0/4) Epoch 1, batch 10500, loss[loss=0.3392, simple_loss=0.3889, pruned_loss=0.1448, over 16887.00 frames. ], tot_loss[loss=0.3594, simple_loss=0.4086, pruned_loss=0.1558, over 32646626.85 frames. ], batch size: 110, lr: 3.43e-02, grad_scale: 32.0 2023-10-09 15:13:21,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=49000.0, ans=0.00021739130434782553 2023-10-09 15:13:24,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 2.384e+02 2.770e+02 3.489e+02 5.638e+02, threshold=5.540e+02, percent-clipped=1.0 2023-10-09 15:13:24,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=49046.666666666664, ans=0.0 2023-10-09 15:13:31,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=49046.666666666664, ans=0.125 2023-10-09 15:14:14,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=49140.0, ans=0.04949747468305833 2023-10-09 15:14:17,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=49186.666666666664, ans=0.0 2023-10-09 15:14:31,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=49233.333333333336, ans=0.125 2023-10-09 15:14:35,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=49233.333333333336, ans=0.125 2023-10-09 15:14:37,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=49280.0, ans=0.2 2023-10-09 15:14:47,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=49280.0, ans=0.125 2023-10-09 15:14:54,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=49326.666666666664, ans=0.125 2023-10-09 15:15:11,892 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.47 vs. limit=22.5 2023-10-09 15:15:13,940 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.05 vs. limit=22.5 2023-10-09 15:15:33,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=49420.0, ans=0.125 2023-10-09 15:15:47,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.676e+02 2.388e+02 2.777e+02 3.279e+02 5.433e+02, threshold=5.553e+02, percent-clipped=0.0 2023-10-09 15:15:52,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=49513.333333333336, ans=0.125 2023-10-09 15:15:53,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=49513.333333333336, ans=0.125 2023-10-09 15:16:02,194 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.80 vs. limit=6.0 2023-10-09 15:16:31,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=49653.333333333336, ans=0.0 2023-10-09 15:16:33,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=49653.333333333336, ans=7.536231884057963e-05 2023-10-09 15:16:35,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=49653.333333333336, ans=0.125 2023-10-09 15:16:37,040 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.44 vs. limit=22.5 2023-10-09 15:16:51,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=49700.0, ans=0.125 2023-10-09 15:17:01,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=49746.666666666664, ans=5.507246376811742e-05 2023-10-09 15:17:03,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=49746.666666666664, ans=0.1 2023-10-09 15:17:17,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=49793.333333333336, ans=0.1 2023-10-09 15:17:49,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=49933.333333333336, ans=0.125 2023-10-09 15:18:01,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.307e+02 2.723e+02 3.167e+02 4.341e+02, threshold=5.446e+02, percent-clipped=0.0 2023-10-09 15:18:02,393 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.93 vs. limit=15.0 2023-10-09 15:18:17,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=50026.666666666664, ans=0.5 2023-10-09 15:18:19,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=50026.666666666664, ans=0.125 2023-10-09 15:18:27,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=50073.333333333336, ans=0.125 2023-10-09 15:18:50,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50166.666666666664, ans=0.1 2023-10-09 15:19:12,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=50213.333333333336, ans=0.0 2023-10-09 15:19:14,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=50260.0, ans=0.0 2023-10-09 15:19:35,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=50306.666666666664, ans=0.125 2023-10-09 15:19:35,640 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.57 vs. limit=15.0 2023-10-09 15:19:36,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=50306.666666666664, ans=0.2 2023-10-09 15:19:59,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.704e+02 2.998e+02 3.579e+02 5.582e+02, threshold=5.996e+02, percent-clipped=1.0 2023-10-09 15:20:06,826 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.17 vs. limit=15.0 2023-10-09 15:20:24,424 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.71 vs. limit=10.0 2023-10-09 15:20:44,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=50586.666666666664, ans=0.1 2023-10-09 15:21:35,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=50726.666666666664, ans=0.0 2023-10-09 15:21:41,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=50726.666666666664, ans=0.0 2023-10-09 15:22:04,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.66 vs. limit=22.5 2023-10-09 15:22:15,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.190e+02 2.491e+02 3.011e+02 5.477e+02, threshold=4.981e+02, percent-clipped=0.0 2023-10-09 15:22:27,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.68 vs. limit=22.5 2023-10-09 15:22:30,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=50913.333333333336, ans=0.125 2023-10-09 15:22:33,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=50960.0, ans=0.0 2023-10-09 15:22:36,487 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:22:37,708 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.61 vs. limit=12.0 2023-10-09 15:22:47,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=51006.666666666664, ans=0.125 2023-10-09 15:22:52,723 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.23 vs. limit=15.0 2023-10-09 15:23:07,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=8.0 2023-10-09 15:23:39,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51146.666666666664, ans=0.1 2023-10-09 15:23:47,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=51193.333333333336, ans=0.125 2023-10-09 15:24:15,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=51286.666666666664, ans=0.125 2023-10-09 15:24:19,535 INFO [train.py:1031] (0/4) Epoch 1, batch 11000, loss[loss=0.3543, simple_loss=0.4154, pruned_loss=0.1467, over 16831.00 frames. ], tot_loss[loss=0.3563, simple_loss=0.4063, pruned_loss=0.1538, over 32628786.62 frames. ], batch size: 146, lr: 3.38e-02, grad_scale: 16.0 2023-10-09 15:24:30,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.405e+02 2.793e+02 3.441e+02 6.201e+02, threshold=5.586e+02, percent-clipped=7.0 2023-10-09 15:24:33,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=51380.0, ans=0.125 2023-10-09 15:24:35,827 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=15.0 2023-10-09 15:24:43,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=51426.666666666664, ans=0.125 2023-10-09 15:24:47,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-10-09 15:24:55,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=51473.333333333336, ans=0.0 2023-10-09 15:25:01,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=51473.333333333336, ans=0.125 2023-10-09 15:25:17,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=51566.666666666664, ans=0.0 2023-10-09 15:25:24,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=51566.666666666664, ans=0.125 2023-10-09 15:25:24,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=51566.666666666664, ans=0.125 2023-10-09 15:25:40,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=51613.333333333336, ans=0.0 2023-10-09 15:25:40,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=51613.333333333336, ans=0.125 2023-10-09 15:25:48,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=51660.0, ans=0.2 2023-10-09 15:25:57,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.21 vs. limit=10.0 2023-10-09 15:26:09,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=51753.333333333336, ans=0.07 2023-10-09 15:26:13,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=51753.333333333336, ans=0.125 2023-10-09 15:26:29,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.89 vs. limit=15.0 2023-10-09 15:26:33,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.204e+02 2.697e+02 3.012e+02 5.561e+02, threshold=5.394e+02, percent-clipped=0.0 2023-10-09 15:26:45,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=51893.333333333336, ans=0.035 2023-10-09 15:26:45,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=51893.333333333336, ans=0.125 2023-10-09 15:27:11,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=51940.0, ans=10.0 2023-10-09 15:28:15,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=52126.666666666664, ans=0.0 2023-10-09 15:29:08,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=52173.333333333336, ans=0.125 2023-10-09 15:29:14,635 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.49 vs. limit=15.0 2023-10-09 15:29:16,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=52220.0, ans=0.0 2023-10-09 15:29:19,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52220.0, ans=0.1 2023-10-09 15:29:36,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=52266.666666666664, ans=0.125 2023-10-09 15:29:39,186 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 2.312e+02 2.746e+02 3.341e+02 5.183e+02, threshold=5.492e+02, percent-clipped=0.0 2023-10-09 15:29:40,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=52313.333333333336, ans=0.0 2023-10-09 15:29:50,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=52360.0, ans=0.125 2023-10-09 15:29:50,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=52360.0, ans=0.02 2023-10-09 15:29:53,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52360.0, ans=0.1 2023-10-09 15:30:00,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52406.666666666664, ans=0.1 2023-10-09 15:30:00,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=52406.666666666664, ans=0.0 2023-10-09 15:30:22,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=52500.0, ans=0.2 2023-10-09 15:30:24,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52500.0, ans=0.1 2023-10-09 15:31:09,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.51 vs. limit=10.0 2023-10-09 15:31:18,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=52686.666666666664, ans=0.125 2023-10-09 15:31:45,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.40 vs. limit=10.0 2023-10-09 15:32:06,039 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.306e+02 2.674e+02 3.067e+02 4.251e+02, threshold=5.347e+02, percent-clipped=0.0 2023-10-09 15:32:18,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=52826.666666666664, ans=0.125 2023-10-09 15:32:28,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.52 vs. limit=15.0 2023-10-09 15:32:30,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=52873.333333333336, ans=0.125 2023-10-09 15:32:32,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=52873.333333333336, ans=22.5 2023-10-09 15:32:58,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=52966.666666666664, ans=10.0 2023-10-09 15:33:01,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=52966.666666666664, ans=0.125 2023-10-09 15:33:02,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.60 vs. limit=15.0 2023-10-09 15:33:11,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-10-09 15:33:20,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=53060.0, ans=0.125 2023-10-09 15:33:24,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=53060.0, ans=0.0 2023-10-09 15:33:32,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=15.0 2023-10-09 15:33:39,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=53106.666666666664, ans=0.0 2023-10-09 15:33:51,330 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.96 vs. limit=15.0 2023-10-09 15:33:53,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-10-09 15:34:00,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=53200.0, ans=0.0 2023-10-09 15:34:02,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=53200.0, ans=0.125 2023-10-09 15:34:03,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=53200.0, ans=0.0 2023-10-09 15:34:06,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.756e+02 2.296e+02 2.667e+02 3.253e+02 5.370e+02, threshold=5.335e+02, percent-clipped=1.0 2023-10-09 15:34:06,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=53246.666666666664, ans=0.0 2023-10-09 15:34:31,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53340.0, ans=0.1 2023-10-09 15:34:52,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-10-09 15:35:11,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=53480.0, ans=0.2 2023-10-09 15:35:43,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=53620.0, ans=0.0 2023-10-09 15:35:50,141 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.40 vs. limit=12.0 2023-10-09 15:35:51,460 INFO [train.py:1031] (0/4) Epoch 1, batch 11500, loss[loss=0.3258, simple_loss=0.3831, pruned_loss=0.1343, over 16628.00 frames. ], tot_loss[loss=0.3527, simple_loss=0.4037, pruned_loss=0.1513, over 32675927.87 frames. ], batch size: 51, lr: 3.33e-02, grad_scale: 32.0 2023-10-09 15:35:51,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=53666.666666666664, ans=0.0 2023-10-09 15:35:55,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=53666.666666666664, ans=0.0 2023-10-09 15:36:04,076 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.420e+02 2.806e+02 3.329e+02 6.853e+02, threshold=5.613e+02, percent-clipped=1.0 2023-10-09 15:36:05,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=53713.333333333336, ans=0.0 2023-10-09 15:36:23,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=53760.0, ans=0.125 2023-10-09 15:36:44,990 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.10 vs. limit=22.5 2023-10-09 15:36:49,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=53853.333333333336, ans=0.0 2023-10-09 15:37:05,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=53946.666666666664, ans=0.125 2023-10-09 15:37:14,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=53946.666666666664, ans=0.2 2023-10-09 15:37:26,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=53993.333333333336, ans=0.125 2023-10-09 15:37:30,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=54040.0, ans=0.025 2023-10-09 15:38:00,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-10-09 15:38:06,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 2.295e+02 2.608e+02 3.083e+02 5.065e+02, threshold=5.215e+02, percent-clipped=0.0 2023-10-09 15:38:26,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54226.666666666664, ans=0.1 2023-10-09 15:38:32,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=54273.333333333336, ans=0.0 2023-10-09 15:38:40,095 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.85 vs. limit=15.0 2023-10-09 15:38:45,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=54320.0, ans=0.125 2023-10-09 15:38:45,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=54320.0, ans=0.2 2023-10-09 15:38:59,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=54366.666666666664, ans=0.125 2023-10-09 15:39:24,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=54460.0, ans=0.2 2023-10-09 15:39:38,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=54553.333333333336, ans=0.125 2023-10-09 15:39:51,414 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-10-09 15:39:58,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.292e+02 2.658e+02 3.262e+02 6.196e+02, threshold=5.317e+02, percent-clipped=1.0 2023-10-09 15:40:37,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54786.666666666664, ans=0.1 2023-10-09 15:40:44,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=54786.666666666664, ans=0.0 2023-10-09 15:41:08,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=54880.0, ans=0.125 2023-10-09 15:41:14,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=54880.0, ans=0.0 2023-10-09 15:41:17,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=54926.666666666664, ans=0.125 2023-10-09 15:41:25,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=54926.666666666664, ans=0.125 2023-10-09 15:41:28,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=54926.666666666664, ans=0.0 2023-10-09 15:41:29,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=54973.333333333336, ans=0.0 2023-10-09 15:42:06,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=55066.666666666664, ans=0.0 2023-10-09 15:42:08,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.279e+02 2.660e+02 3.339e+02 5.764e+02, threshold=5.319e+02, percent-clipped=2.0 2023-10-09 15:42:16,647 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-10-09 15:42:31,371 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-10-09 15:42:46,168 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:42:46,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=55253.333333333336, ans=0.125 2023-10-09 15:43:04,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=55300.0, ans=0.0 2023-10-09 15:43:09,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=55346.666666666664, ans=0.125 2023-10-09 15:43:15,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55346.666666666664, ans=0.1 2023-10-09 15:43:31,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=55393.333333333336, ans=0.0 2023-10-09 15:43:55,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=55486.666666666664, ans=0.125 2023-10-09 15:43:56,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=55486.666666666664, ans=0.0 2023-10-09 15:44:01,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=55533.333333333336, ans=0.125 2023-10-09 15:44:11,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.365e+02 2.721e+02 3.221e+02 4.932e+02, threshold=5.442e+02, percent-clipped=0.0 2023-10-09 15:44:12,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=55580.0, ans=0.0 2023-10-09 15:44:30,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-10-09 15:44:35,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=55626.666666666664, ans=0.0 2023-10-09 15:44:59,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=55720.0, ans=0.2 2023-10-09 15:45:36,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=55813.333333333336, ans=0.125 2023-10-09 15:45:43,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=55860.0, ans=0.0 2023-10-09 15:45:43,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=55860.0, ans=0.125 2023-10-09 15:45:50,702 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:45:52,045 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.26 vs. limit=6.0 2023-10-09 15:45:58,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=55906.666666666664, ans=0.125 2023-10-09 15:45:59,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=55906.666666666664, ans=0.0 2023-10-09 15:46:07,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=55953.333333333336, ans=0.0 2023-10-09 15:46:12,767 INFO [train.py:1031] (0/4) Epoch 1, batch 12000, loss[loss=0.3336, simple_loss=0.3877, pruned_loss=0.1397, over 16338.00 frames. ], tot_loss[loss=0.3499, simple_loss=0.402, pruned_loss=0.1492, over 32721429.37 frames. ], batch size: 50, lr: 3.28e-02, grad_scale: 32.0 2023-10-09 15:46:30,329 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.448e+02 2.864e+02 3.433e+02 4.684e+02, threshold=5.727e+02, percent-clipped=0.0 2023-10-09 15:46:42,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=56093.333333333336, ans=0.125 2023-10-09 15:46:52,696 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.33 vs. limit=15.0 2023-10-09 15:46:56,050 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.35 vs. limit=15.0 2023-10-09 15:47:04,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=56186.666666666664, ans=0.2 2023-10-09 15:47:09,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.04 vs. limit=15.0 2023-10-09 15:47:26,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=56233.333333333336, ans=0.125 2023-10-09 15:47:57,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=56373.333333333336, ans=0.125 2023-10-09 15:48:01,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=56373.333333333336, ans=0.0 2023-10-09 15:48:02,564 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:48:15,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=56420.0, ans=0.1 2023-10-09 15:48:30,134 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.328e+02 2.638e+02 3.008e+02 5.387e+02, threshold=5.275e+02, percent-clipped=0.0 2023-10-09 15:48:35,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=56513.333333333336, ans=0.125 2023-10-09 15:48:46,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=56560.0, ans=0.125 2023-10-09 15:48:58,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=56606.666666666664, ans=0.0 2023-10-09 15:48:58,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=56606.666666666664, ans=0.1 2023-10-09 15:49:14,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=56700.0, ans=0.125 2023-10-09 15:49:14,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=56700.0, ans=0.125 2023-10-09 15:50:12,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=56793.333333333336, ans=0.125 2023-10-09 15:50:28,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=56793.333333333336, ans=0.1 2023-10-09 15:50:42,032 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:51:05,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.337e+02 2.722e+02 3.256e+02 4.867e+02, threshold=5.443e+02, percent-clipped=0.0 2023-10-09 15:51:09,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=56980.0, ans=0.0 2023-10-09 15:51:34,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=57073.333333333336, ans=0.2 2023-10-09 15:51:35,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57073.333333333336, ans=0.1 2023-10-09 15:52:02,354 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-10-09 15:52:02,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=57213.333333333336, ans=0.125 2023-10-09 15:52:10,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=57213.333333333336, ans=0.0 2023-10-09 15:52:20,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=57260.0, ans=0.0 2023-10-09 15:52:24,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=57306.666666666664, ans=0.5 2023-10-09 15:52:35,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=15.0 2023-10-09 15:52:39,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=57353.333333333336, ans=0.1 2023-10-09 15:52:39,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=57353.333333333336, ans=0.0 2023-10-09 15:52:39,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.18 vs. limit=15.0 2023-10-09 15:52:43,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.87 vs. limit=10.0 2023-10-09 15:53:02,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.420e+02 2.824e+02 3.045e+02 4.744e+02, threshold=5.648e+02, percent-clipped=0.0 2023-10-09 15:53:03,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=57446.666666666664, ans=0.125 2023-10-09 15:53:05,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=57446.666666666664, ans=0.2 2023-10-09 15:53:28,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=57540.0, ans=0.125 2023-10-09 15:53:59,420 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 15:54:11,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=57680.0, ans=0.035 2023-10-09 15:54:21,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=57726.666666666664, ans=0.0 2023-10-09 15:55:01,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.539e+02 2.866e+02 3.399e+02 5.145e+02, threshold=5.732e+02, percent-clipped=0.0 2023-10-09 15:55:08,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=57913.333333333336, ans=0.2 2023-10-09 15:55:08,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=57913.333333333336, ans=0.125 2023-10-09 15:55:09,124 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.43 vs. limit=15.0 2023-10-09 15:55:20,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=57960.0, ans=10.0 2023-10-09 15:55:20,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=57960.0, ans=0.125 2023-10-09 15:55:21,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=58006.666666666664, ans=0.125 2023-10-09 15:55:38,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=58053.333333333336, ans=0.125 2023-10-09 15:56:02,418 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.44 vs. limit=15.0 2023-10-09 15:56:18,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=58193.333333333336, ans=0.5 2023-10-09 15:56:20,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=58193.333333333336, ans=0.0 2023-10-09 15:56:46,410 INFO [train.py:1031] (0/4) Epoch 1, batch 12500, loss[loss=0.3292, simple_loss=0.3949, pruned_loss=0.1318, over 16544.00 frames. ], tot_loss[loss=0.3471, simple_loss=0.3998, pruned_loss=0.1474, over 32712131.92 frames. ], batch size: 61, lr: 3.23e-02, grad_scale: 32.0 2023-10-09 15:56:51,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=58333.333333333336, ans=0.125 2023-10-09 15:56:53,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=58333.333333333336, ans=0.125 2023-10-09 15:56:57,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.770e+02 2.346e+02 2.659e+02 3.058e+02 4.471e+02, threshold=5.317e+02, percent-clipped=0.0 2023-10-09 15:57:14,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=58426.666666666664, ans=0.0 2023-10-09 15:57:37,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=58520.0, ans=0.2 2023-10-09 15:57:41,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=58566.666666666664, ans=0.0 2023-10-09 15:58:00,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=58613.333333333336, ans=0.0 2023-10-09 15:58:03,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=58660.0, ans=0.125 2023-10-09 15:58:12,011 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.33 vs. limit=12.0 2023-10-09 15:58:18,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=58706.666666666664, ans=0.125 2023-10-09 15:58:20,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=58706.666666666664, ans=0.2 2023-10-09 15:58:31,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=58753.333333333336, ans=0.125 2023-10-09 15:58:43,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=58800.0, ans=0.2 2023-10-09 15:58:51,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.294e+02 2.596e+02 2.990e+02 4.793e+02, threshold=5.193e+02, percent-clipped=0.0 2023-10-09 15:59:27,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=58986.666666666664, ans=0.125 2023-10-09 15:59:27,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=58986.666666666664, ans=0.125 2023-10-09 15:59:28,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=58986.666666666664, ans=0.2 2023-10-09 15:59:29,968 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.97 vs. limit=15.0 2023-10-09 15:59:49,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59080.0, ans=0.1 2023-10-09 15:59:54,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59080.0, ans=0.1 2023-10-09 16:00:00,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=59080.0, ans=0.015 2023-10-09 16:00:38,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=59173.333333333336, ans=0.2 2023-10-09 16:01:07,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59220.0, ans=0.1 2023-10-09 16:01:16,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=59266.666666666664, ans=0.0 2023-10-09 16:01:25,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.297e+02 2.794e+02 3.101e+02 4.689e+02, threshold=5.588e+02, percent-clipped=0.0 2023-10-09 16:01:39,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.00 vs. limit=22.5 2023-10-09 16:02:10,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59500.0, ans=0.1 2023-10-09 16:02:14,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.48 vs. limit=22.5 2023-10-09 16:02:15,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59500.0, ans=0.1 2023-10-09 16:02:32,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=59593.333333333336, ans=0.2 2023-10-09 16:02:46,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=59640.0, ans=0.09899494936611666 2023-10-09 16:03:19,582 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.339e+02 2.722e+02 3.043e+02 4.542e+02, threshold=5.444e+02, percent-clipped=0.0 2023-10-09 16:03:36,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=12.0 2023-10-09 16:03:50,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=59873.333333333336, ans=0.2 2023-10-09 16:04:08,673 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.544e-03 2023-10-09 16:04:17,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=60013.333333333336, ans=0.125 2023-10-09 16:04:31,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=60060.0, ans=0.0 2023-10-09 16:04:47,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=60106.666666666664, ans=0.125 2023-10-09 16:04:59,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=60153.333333333336, ans=0.0 2023-10-09 16:05:01,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=60153.333333333336, ans=0.0 2023-10-09 16:05:16,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.361e+02 2.827e+02 3.291e+02 4.917e+02, threshold=5.654e+02, percent-clipped=0.0 2023-10-09 16:05:33,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=60293.333333333336, ans=0.125 2023-10-09 16:05:35,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=60340.0, ans=0.125 2023-10-09 16:05:41,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60340.0, ans=0.1 2023-10-09 16:05:46,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60340.0, ans=0.1 2023-10-09 16:05:48,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60386.666666666664, ans=0.1 2023-10-09 16:05:52,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=60386.666666666664, ans=0.07 2023-10-09 16:05:57,144 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:06:10,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=60480.0, ans=0.0 2023-10-09 16:06:13,767 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:06:19,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60526.666666666664, ans=0.1 2023-10-09 16:06:21,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=60526.666666666664, ans=0.0 2023-10-09 16:06:28,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=60526.666666666664, ans=0.1 2023-10-09 16:06:52,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=60620.0, ans=0.125 2023-10-09 16:06:55,217 INFO [train.py:1031] (0/4) Epoch 1, batch 13000, loss[loss=0.3253, simple_loss=0.38, pruned_loss=0.1354, over 16157.00 frames. ], tot_loss[loss=0.3458, simple_loss=0.3993, pruned_loss=0.1464, over 32757740.46 frames. ], batch size: 43, lr: 3.18e-02, grad_scale: 32.0 2023-10-09 16:07:02,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=60666.666666666664, ans=0.0 2023-10-09 16:07:23,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.507e+02 3.122e+02 3.868e+02 6.282e+02, threshold=6.244e+02, percent-clipped=3.0 2023-10-09 16:07:37,559 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-10-09 16:07:51,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=60760.0, ans=0.0 2023-10-09 16:08:49,764 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:08:54,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=61040.0, ans=0.0 2023-10-09 16:08:57,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=61040.0, ans=0.0 2023-10-09 16:09:09,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61086.666666666664, ans=0.1 2023-10-09 16:09:13,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=61086.666666666664, ans=0.125 2023-10-09 16:09:33,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.413e+02 2.867e+02 3.318e+02 4.884e+02, threshold=5.734e+02, percent-clipped=0.0 2023-10-09 16:10:16,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=61320.0, ans=0.0 2023-10-09 16:10:26,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=61366.666666666664, ans=0.5 2023-10-09 16:10:27,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=61413.333333333336, ans=0.09899494936611666 2023-10-09 16:10:32,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=61413.333333333336, ans=0.0 2023-10-09 16:10:37,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-10-09 16:10:52,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-10-09 16:11:06,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=61553.333333333336, ans=0.2 2023-10-09 16:11:07,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61553.333333333336, ans=0.1 2023-10-09 16:11:24,991 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.61 vs. limit=15.0 2023-10-09 16:11:31,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.500e+02 2.860e+02 3.236e+02 4.527e+02, threshold=5.721e+02, percent-clipped=0.0 2023-10-09 16:11:52,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=61740.0, ans=0.125 2023-10-09 16:11:56,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=61740.0, ans=0.125 2023-10-09 16:12:03,784 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.18 vs. limit=15.0 2023-10-09 16:12:07,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.90 vs. limit=15.0 2023-10-09 16:12:12,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=61786.666666666664, ans=0.0 2023-10-09 16:12:34,473 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.17 vs. limit=22.5 2023-10-09 16:12:35,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=61880.0, ans=0.0 2023-10-09 16:12:53,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=61973.333333333336, ans=0.125 2023-10-09 16:13:03,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=62020.0, ans=0.0 2023-10-09 16:13:15,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=62066.666666666664, ans=0.125 2023-10-09 16:13:20,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=62066.666666666664, ans=0.2 2023-10-09 16:13:25,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 2.308e+02 2.743e+02 3.164e+02 4.193e+02, threshold=5.485e+02, percent-clipped=0.0 2023-10-09 16:13:34,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=62160.0, ans=0.04949747468305833 2023-10-09 16:13:36,628 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.39 vs. limit=15.0 2023-10-09 16:13:37,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=62160.0, ans=0.125 2023-10-09 16:13:37,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=62160.0, ans=0.0 2023-10-09 16:13:39,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=62160.0, ans=0.0 2023-10-09 16:13:44,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=62160.0, ans=0.0 2023-10-09 16:13:57,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=62253.333333333336, ans=10.0 2023-10-09 16:14:08,413 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:14:44,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=62393.333333333336, ans=0.2 2023-10-09 16:14:53,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=62440.0, ans=0.125 2023-10-09 16:15:05,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=62440.0, ans=0.125 2023-10-09 16:15:05,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.83 vs. limit=10.0 2023-10-09 16:15:14,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=62486.666666666664, ans=0.2 2023-10-09 16:15:30,351 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:15:30,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.302e+02 2.597e+02 2.889e+02 4.663e+02, threshold=5.193e+02, percent-clipped=0.0 2023-10-09 16:15:37,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-10-09 16:16:26,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=62813.333333333336, ans=0.1 2023-10-09 16:16:39,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=62860.0, ans=0.125 2023-10-09 16:16:45,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.46 vs. limit=15.0 2023-10-09 16:16:52,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=62906.666666666664, ans=0.125 2023-10-09 16:16:53,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=62906.666666666664, ans=0.125 2023-10-09 16:17:01,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=62953.333333333336, ans=0.125 2023-10-09 16:17:09,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=62953.333333333336, ans=0.0 2023-10-09 16:17:11,420 INFO [train.py:1031] (0/4) Epoch 1, batch 13500, loss[loss=0.3394, simple_loss=0.3975, pruned_loss=0.1406, over 16730.00 frames. ], tot_loss[loss=0.3427, simple_loss=0.3968, pruned_loss=0.1444, over 32737886.65 frames. ], batch size: 202, lr: 3.14e-02, grad_scale: 32.0 2023-10-09 16:17:18,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=63000.0, ans=0.0 2023-10-09 16:17:23,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.381e+02 2.740e+02 3.450e+02 5.521e+02, threshold=5.480e+02, percent-clipped=2.0 2023-10-09 16:17:32,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=63046.666666666664, ans=0.2 2023-10-09 16:17:53,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=63140.0, ans=0.0 2023-10-09 16:18:49,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=63373.333333333336, ans=0.05 2023-10-09 16:18:59,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63420.0, ans=0.1 2023-10-09 16:19:01,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63420.0, ans=0.1 2023-10-09 16:19:10,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=63466.666666666664, ans=0.0 2023-10-09 16:19:21,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 2.355e+02 2.753e+02 3.090e+02 4.398e+02, threshold=5.506e+02, percent-clipped=0.0 2023-10-09 16:19:31,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.92 vs. limit=10.0 2023-10-09 16:19:37,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.54 vs. limit=15.0 2023-10-09 16:19:43,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=63606.666666666664, ans=0.0 2023-10-09 16:19:46,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63606.666666666664, ans=0.1 2023-10-09 16:19:48,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=63606.666666666664, ans=0.125 2023-10-09 16:19:57,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63653.333333333336, ans=0.1 2023-10-09 16:20:06,749 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-1.pt 2023-10-09 16:20:40,166 INFO [train.py:1031] (0/4) Epoch 2, batch 0, loss[loss=0.3227, simple_loss=0.3777, pruned_loss=0.1339, over 16645.00 frames. ], tot_loss[loss=0.3227, simple_loss=0.3777, pruned_loss=0.1339, over 16645.00 frames. ], batch size: 241, lr: 2.63e-02, grad_scale: 32.0 2023-10-09 16:20:40,167 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-09 16:20:48,136 INFO [train.py:1063] (0/4) Epoch 2, validation: loss=0.3074, simple_loss=0.3842, pruned_loss=0.1153, over 1020973.00 frames. 2023-10-09 16:20:48,136 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-09 16:21:07,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=63770.0, ans=0.125 2023-10-09 16:21:16,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=63816.666666666664, ans=0.5 2023-10-09 16:21:20,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=63816.666666666664, ans=0.125 2023-10-09 16:21:25,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=63863.333333333336, ans=0.125 2023-10-09 16:21:38,094 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.81 vs. limit=15.0 2023-10-09 16:21:55,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=63956.666666666664, ans=10.0 2023-10-09 16:21:57,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.350e+02 2.672e+02 3.244e+02 4.748e+02, threshold=5.344e+02, percent-clipped=0.0 2023-10-09 16:22:01,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=63956.666666666664, ans=0.0 2023-10-09 16:22:16,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=64050.0, ans=0.125 2023-10-09 16:22:25,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=64050.0, ans=0.125 2023-10-09 16:22:52,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=64190.0, ans=0.0 2023-10-09 16:23:01,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=64236.666666666664, ans=0.125 2023-10-09 16:23:30,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=64330.0, ans=0.125 2023-10-09 16:23:31,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2023-10-09 16:23:34,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=64330.0, ans=0.0 2023-10-09 16:23:36,009 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.33 vs. limit=6.0 2023-10-09 16:23:54,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.175e+02 2.474e+02 2.891e+02 4.689e+02, threshold=4.948e+02, percent-clipped=0.0 2023-10-09 16:23:56,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=64423.333333333336, ans=0.0 2023-10-09 16:24:01,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=64470.0, ans=0.125 2023-10-09 16:24:18,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=64516.666666666664, ans=0.2 2023-10-09 16:24:20,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.04 vs. limit=6.0 2023-10-09 16:24:24,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=64563.333333333336, ans=0.125 2023-10-09 16:24:28,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=64563.333333333336, ans=0.1 2023-10-09 16:24:31,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=64610.0, ans=0.125 2023-10-09 16:24:32,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=64610.0, ans=10.0 2023-10-09 16:24:46,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=64656.666666666664, ans=10.0 2023-10-09 16:25:15,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=64796.666666666664, ans=0.0 2023-10-09 16:25:48,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.049e+02 2.283e+02 2.687e+02 4.170e+02, threshold=4.566e+02, percent-clipped=0.0 2023-10-09 16:25:53,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=64936.666666666664, ans=0.125 2023-10-09 16:26:16,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=22.5 2023-10-09 16:26:18,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=65030.0, ans=0.5 2023-10-09 16:26:18,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=65030.0, ans=0.125 2023-10-09 16:26:21,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=65030.0, ans=0.125 2023-10-09 16:26:29,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.14 vs. limit=15.0 2023-10-09 16:26:31,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=65076.666666666664, ans=0.125 2023-10-09 16:26:36,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=65123.333333333336, ans=0.035 2023-10-09 16:26:44,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=65123.333333333336, ans=0.125 2023-10-09 16:26:51,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=65170.0, ans=15.0 2023-10-09 16:27:16,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=65263.333333333336, ans=0.2 2023-10-09 16:27:24,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.18 vs. limit=12.0 2023-10-09 16:27:34,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 2.170e+02 2.445e+02 2.887e+02 4.355e+02, threshold=4.890e+02, percent-clipped=0.0 2023-10-09 16:27:35,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=65356.666666666664, ans=0.125 2023-10-09 16:27:40,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=65403.333333333336, ans=0.02 2023-10-09 16:27:50,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=65450.0, ans=0.1 2023-10-09 16:27:52,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=65450.0, ans=0.125 2023-10-09 16:27:59,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=65496.666666666664, ans=0.125 2023-10-09 16:28:19,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=65590.0, ans=0.125 2023-10-09 16:28:36,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=65636.66666666667, ans=0.2 2023-10-09 16:28:42,344 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.51 vs. limit=22.5 2023-10-09 16:28:58,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=65730.0, ans=0.125 2023-10-09 16:29:00,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=65730.0, ans=0.0 2023-10-09 16:29:06,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=65776.66666666667, ans=0.2 2023-10-09 16:29:23,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.212e+02 2.551e+02 2.907e+02 5.719e+02, threshold=5.101e+02, percent-clipped=2.0 2023-10-09 16:29:39,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=65916.66666666667, ans=0.125 2023-10-09 16:29:50,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=65963.33333333333, ans=0.0 2023-10-09 16:30:08,382 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.14 vs. limit=15.0 2023-10-09 16:30:09,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=66010.0, ans=0.125 2023-10-09 16:30:13,139 INFO [train.py:1031] (0/4) Epoch 2, batch 500, loss[loss=0.2975, simple_loss=0.3587, pruned_loss=0.1181, over 16910.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3834, pruned_loss=0.1314, over 7306270.72 frames. ], batch size: 165, lr: 2.59e-02, grad_scale: 32.0 2023-10-09 16:30:17,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=66056.66666666667, ans=0.125 2023-10-09 16:30:32,002 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-10-09 16:30:32,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66103.33333333333, ans=0.1 2023-10-09 16:30:43,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=66150.0, ans=10.0 2023-10-09 16:30:49,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.94 vs. limit=22.5 2023-10-09 16:31:03,595 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-10-09 16:31:11,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66290.0, ans=0.1 2023-10-09 16:31:15,568 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 2.168e+02 2.544e+02 2.989e+02 5.347e+02, threshold=5.087e+02, percent-clipped=1.0 2023-10-09 16:31:16,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=66290.0, ans=0.125 2023-10-09 16:31:18,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=66290.0, ans=0.125 2023-10-09 16:31:48,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=66383.33333333333, ans=0.125 2023-10-09 16:32:04,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.09 vs. limit=8.0 2023-10-09 16:32:14,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=66476.66666666667, ans=0.125 2023-10-09 16:32:28,040 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.48 vs. limit=22.5 2023-10-09 16:32:30,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=66523.33333333333, ans=0.125 2023-10-09 16:32:31,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66523.33333333333, ans=0.1 2023-10-09 16:32:31,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66523.33333333333, ans=0.1 2023-10-09 16:32:41,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=66616.66666666667, ans=0.95 2023-10-09 16:32:45,032 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.37 vs. limit=22.5 2023-10-09 16:32:45,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=66616.66666666667, ans=0.0 2023-10-09 16:32:55,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=66663.33333333333, ans=0.07 2023-10-09 16:33:08,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=66710.0, ans=0.0 2023-10-09 16:33:14,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=66710.0, ans=0.0 2023-10-09 16:33:20,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66756.66666666667, ans=0.1 2023-10-09 16:33:22,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=66756.66666666667, ans=10.0 2023-10-09 16:33:23,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.204e+02 2.493e+02 2.814e+02 4.565e+02, threshold=4.986e+02, percent-clipped=0.0 2023-10-09 16:33:36,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66803.33333333333, ans=0.1 2023-10-09 16:34:01,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=66896.66666666667, ans=0.125 2023-10-09 16:34:01,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66896.66666666667, ans=0.1 2023-10-09 16:34:48,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=67083.33333333333, ans=0.1 2023-10-09 16:34:54,891 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.57 vs. limit=15.0 2023-10-09 16:35:10,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=67176.66666666667, ans=0.0 2023-10-09 16:35:16,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=67176.66666666667, ans=0.2 2023-10-09 16:35:27,464 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.208e+02 2.513e+02 2.866e+02 4.304e+02, threshold=5.027e+02, percent-clipped=0.0 2023-10-09 16:35:31,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=67270.0, ans=0.125 2023-10-09 16:35:41,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=67316.66666666667, ans=0.125 2023-10-09 16:35:52,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=67363.33333333333, ans=0.05 2023-10-09 16:36:00,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-10-09 16:36:01,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=67410.0, ans=0.2 2023-10-09 16:36:32,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.83 vs. limit=22.5 2023-10-09 16:37:06,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=67596.66666666667, ans=0.125 2023-10-09 16:37:08,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=67643.33333333333, ans=0.125 2023-10-09 16:37:12,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=67643.33333333333, ans=0.0 2023-10-09 16:37:27,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 2.224e+02 2.569e+02 2.872e+02 4.220e+02, threshold=5.137e+02, percent-clipped=0.0 2023-10-09 16:37:33,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=67736.66666666667, ans=0.125 2023-10-09 16:37:56,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=67830.0, ans=0.125 2023-10-09 16:37:57,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67830.0, ans=0.1 2023-10-09 16:37:59,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=67830.0, ans=0.2 2023-10-09 16:38:01,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.00 vs. limit=22.5 2023-10-09 16:38:02,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67830.0, ans=0.1 2023-10-09 16:38:10,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=67876.66666666667, ans=0.125 2023-10-09 16:38:12,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=67876.66666666667, ans=0.125 2023-10-09 16:38:13,790 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.91 vs. limit=15.0 2023-10-09 16:38:19,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=67923.33333333333, ans=0.2 2023-10-09 16:38:25,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=67923.33333333333, ans=0.07 2023-10-09 16:38:33,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=67970.0, ans=0.0 2023-10-09 16:38:42,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=68016.66666666667, ans=0.2 2023-10-09 16:38:54,154 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:39:02,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=68063.33333333333, ans=0.1 2023-10-09 16:39:04,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=68110.0, ans=0.0 2023-10-09 16:39:08,645 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-10-09 16:39:27,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 2.089e+02 2.451e+02 2.809e+02 3.972e+02, threshold=4.902e+02, percent-clipped=0.0 2023-10-09 16:39:42,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=68250.0, ans=0.125 2023-10-09 16:39:48,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=68250.0, ans=0.0 2023-10-09 16:39:53,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=68296.66666666667, ans=0.125 2023-10-09 16:40:16,110 INFO [train.py:1031] (0/4) Epoch 2, batch 1000, loss[loss=0.2989, simple_loss=0.3618, pruned_loss=0.118, over 15365.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3821, pruned_loss=0.1306, over 12920344.15 frames. ], batch size: 35, lr: 2.55e-02, grad_scale: 32.0 2023-10-09 16:40:59,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=68576.66666666667, ans=0.2 2023-10-09 16:41:18,148 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 2.111e+02 2.370e+02 2.649e+02 5.291e+02, threshold=4.741e+02, percent-clipped=2.0 2023-10-09 16:41:24,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=68670.0, ans=0.125 2023-10-09 16:41:37,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=68716.66666666667, ans=0.125 2023-10-09 16:41:42,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=68763.33333333333, ans=0.1 2023-10-09 16:42:18,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=68903.33333333333, ans=0.125 2023-10-09 16:42:21,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=68903.33333333333, ans=0.125 2023-10-09 16:42:27,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=68950.0, ans=0.125 2023-10-09 16:42:35,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=68950.0, ans=0.125 2023-10-09 16:42:36,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=68950.0, ans=0.125 2023-10-09 16:42:40,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=68996.66666666667, ans=0.125 2023-10-09 16:43:00,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=69043.33333333333, ans=0.5 2023-10-09 16:43:13,726 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.259e+02 2.528e+02 3.055e+02 4.519e+02, threshold=5.057e+02, percent-clipped=0.0 2023-10-09 16:43:21,060 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.39 vs. limit=15.0 2023-10-09 16:43:39,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=69183.33333333333, ans=0.125 2023-10-09 16:43:57,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=69230.0, ans=0.2 2023-10-09 16:44:09,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=69276.66666666667, ans=0.125 2023-10-09 16:44:16,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=69323.33333333333, ans=0.2 2023-10-09 16:44:45,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=69416.66666666667, ans=0.05 2023-10-09 16:44:54,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=69463.33333333333, ans=0.125 2023-10-09 16:45:22,274 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.974e+02 2.158e+02 2.514e+02 3.561e+02, threshold=4.315e+02, percent-clipped=0.0 2023-10-09 16:45:32,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=69603.33333333333, ans=0.125 2023-10-09 16:46:07,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=69743.33333333333, ans=0.0 2023-10-09 16:46:09,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=69743.33333333333, ans=0.125 2023-10-09 16:46:22,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=69790.0, ans=0.125 2023-10-09 16:46:28,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=69836.66666666667, ans=0.2 2023-10-09 16:47:02,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=69930.0, ans=0.125 2023-10-09 16:47:15,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=70023.33333333333, ans=0.125 2023-10-09 16:47:19,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=70023.33333333333, ans=0.0 2023-10-09 16:47:22,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 2.290e+02 2.600e+02 2.934e+02 4.496e+02, threshold=5.200e+02, percent-clipped=2.0 2023-10-09 16:47:38,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=70116.66666666667, ans=0.0 2023-10-09 16:47:47,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.09 vs. limit=15.0 2023-10-09 16:48:27,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=70303.33333333333, ans=0.035 2023-10-09 16:48:36,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=70350.0, ans=0.07 2023-10-09 16:48:41,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=70350.0, ans=0.1 2023-10-09 16:48:50,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=70396.66666666667, ans=0.015 2023-10-09 16:49:07,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=70443.33333333333, ans=0.0 2023-10-09 16:49:19,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.147e+02 2.406e+02 2.659e+02 4.241e+02, threshold=4.813e+02, percent-clipped=0.0 2023-10-09 16:49:48,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=70583.33333333333, ans=0.125 2023-10-09 16:50:19,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=70630.0, ans=0.2 2023-10-09 16:50:22,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=70676.66666666667, ans=0.125 2023-10-09 16:50:33,820 INFO [train.py:1031] (0/4) Epoch 2, batch 1500, loss[loss=0.2986, simple_loss=0.3649, pruned_loss=0.1161, over 16854.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3788, pruned_loss=0.1282, over 17334886.36 frames. ], batch size: 146, lr: 2.52e-02, grad_scale: 32.0 2023-10-09 16:50:50,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2023-10-09 16:50:51,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=70770.0, ans=0.2 2023-10-09 16:50:52,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=70770.0, ans=0.125 2023-10-09 16:50:52,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=70770.0, ans=0.0 2023-10-09 16:51:00,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=70816.66666666667, ans=0.0 2023-10-09 16:51:10,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=70816.66666666667, ans=0.2 2023-10-09 16:51:26,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=70863.33333333333, ans=0.1 2023-10-09 16:51:31,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=70910.0, ans=0.1 2023-10-09 16:51:32,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.04 vs. limit=15.0 2023-10-09 16:51:35,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=70910.0, ans=0.2 2023-10-09 16:51:53,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.134e+02 2.459e+02 2.793e+02 3.782e+02, threshold=4.918e+02, percent-clipped=0.0 2023-10-09 16:52:01,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=71003.33333333333, ans=0.2 2023-10-09 16:52:01,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=71003.33333333333, ans=0.125 2023-10-09 16:52:02,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=71003.33333333333, ans=0.125 2023-10-09 16:52:18,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=71050.0, ans=0.125 2023-10-09 16:52:21,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=71096.66666666667, ans=0.125 2023-10-09 16:52:22,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=71096.66666666667, ans=10.0 2023-10-09 16:52:24,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=71096.66666666667, ans=0.125 2023-10-09 16:52:29,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71096.66666666667, ans=0.1 2023-10-09 16:52:31,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=71096.66666666667, ans=0.0 2023-10-09 16:52:36,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=71143.33333333333, ans=0.125 2023-10-09 16:52:54,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=71236.66666666667, ans=0.05 2023-10-09 16:52:55,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=71236.66666666667, ans=0.2 2023-10-09 16:52:58,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=71236.66666666667, ans=0.0 2023-10-09 16:53:04,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=71236.66666666667, ans=0.0 2023-10-09 16:53:21,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=71330.0, ans=0.125 2023-10-09 16:53:25,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=71330.0, ans=0.0 2023-10-09 16:53:30,869 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:53:58,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.153e+02 2.464e+02 2.808e+02 3.679e+02, threshold=4.927e+02, percent-clipped=0.0 2023-10-09 16:54:04,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=71470.0, ans=0.0 2023-10-09 16:54:27,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=71563.33333333333, ans=0.0 2023-10-09 16:55:01,580 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=15.0 2023-10-09 16:55:19,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=71750.0, ans=0.07 2023-10-09 16:55:31,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71843.33333333333, ans=0.1 2023-10-09 16:55:44,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71890.0, ans=0.1 2023-10-09 16:55:50,977 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.675e+02 2.360e+02 2.595e+02 3.075e+02 4.930e+02, threshold=5.190e+02, percent-clipped=1.0 2023-10-09 16:55:51,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=71890.0, ans=0.0 2023-10-09 16:56:10,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=71983.33333333333, ans=0.125 2023-10-09 16:56:16,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=71983.33333333333, ans=0.125 2023-10-09 16:56:34,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=72076.66666666667, ans=0.2 2023-10-09 16:56:52,267 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:57:31,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=72310.0, ans=0.2 2023-10-09 16:57:50,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.312e+02 2.668e+02 3.373e+02 4.905e+02, threshold=5.336e+02, percent-clipped=0.0 2023-10-09 16:57:50,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=72356.66666666667, ans=0.2 2023-10-09 16:58:04,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.09 vs. limit=15.0 2023-10-09 16:58:05,748 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:58:21,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=72496.66666666667, ans=0.125 2023-10-09 16:58:29,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=72543.33333333333, ans=0.125 2023-10-09 16:58:44,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=72590.0, ans=0.125 2023-10-09 16:58:46,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=72590.0, ans=0.125 2023-10-09 16:58:53,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.81 vs. limit=22.5 2023-10-09 16:59:13,635 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 16:59:58,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 2.198e+02 2.537e+02 2.863e+02 4.831e+02, threshold=5.075e+02, percent-clipped=0.0 2023-10-09 17:00:08,730 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.46 vs. limit=6.0 2023-10-09 17:00:11,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72870.0, ans=0.1 2023-10-09 17:00:31,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=72963.33333333333, ans=0.125 2023-10-09 17:00:45,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=73010.0, ans=0.125 2023-10-09 17:00:56,160 INFO [train.py:1031] (0/4) Epoch 2, batch 2000, loss[loss=0.3046, simple_loss=0.377, pruned_loss=0.1161, over 16824.00 frames. ], tot_loss[loss=0.3162, simple_loss=0.3781, pruned_loss=0.1271, over 20770075.24 frames. ], batch size: 130, lr: 2.49e-02, grad_scale: 32.0 2023-10-09 17:02:17,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 2.298e+02 2.585e+02 2.929e+02 3.982e+02, threshold=5.170e+02, percent-clipped=0.0 2023-10-09 17:02:33,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=73383.33333333333, ans=0.02 2023-10-09 17:02:35,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=73383.33333333333, ans=0.125 2023-10-09 17:02:38,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-10-09 17:02:53,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=73430.0, ans=0.0 2023-10-09 17:02:54,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.86 vs. limit=6.0 2023-10-09 17:02:54,743 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-10-09 17:03:01,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=73476.66666666667, ans=0.035 2023-10-09 17:03:07,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=73476.66666666667, ans=0.125 2023-10-09 17:03:16,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=73523.33333333333, ans=0.0 2023-10-09 17:03:54,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=73616.66666666667, ans=0.0 2023-10-09 17:03:58,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-10-09 17:04:41,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.164e+02 2.490e+02 2.805e+02 6.104e+02, threshold=4.980e+02, percent-clipped=1.0 2023-10-09 17:05:14,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=73896.66666666667, ans=0.125 2023-10-09 17:05:38,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=73990.0, ans=0.125 2023-10-09 17:05:45,332 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.11 vs. limit=10.0 2023-10-09 17:06:05,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=74083.33333333333, ans=0.125 2023-10-09 17:06:12,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=74130.0, ans=0.1 2023-10-09 17:06:13,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=74130.0, ans=0.0 2023-10-09 17:06:18,678 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.53 vs. limit=15.0 2023-10-09 17:06:31,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=74223.33333333333, ans=0.125 2023-10-09 17:06:38,905 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 2.202e+02 2.618e+02 3.068e+02 5.008e+02, threshold=5.236e+02, percent-clipped=1.0 2023-10-09 17:06:40,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=74223.33333333333, ans=0.1 2023-10-09 17:06:48,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=74270.0, ans=0.5 2023-10-09 17:06:58,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.28 vs. limit=6.0 2023-10-09 17:07:00,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=74316.66666666667, ans=0.025 2023-10-09 17:07:13,471 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.40 vs. limit=10.0 2023-10-09 17:07:22,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=74410.0, ans=0.025 2023-10-09 17:07:23,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=74410.0, ans=0.125 2023-10-09 17:07:34,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=74456.66666666667, ans=0.125 2023-10-09 17:07:55,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=74550.0, ans=0.0 2023-10-09 17:08:12,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=74596.66666666667, ans=0.125 2023-10-09 17:08:20,530 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-16000.pt 2023-10-09 17:08:35,761 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.366e+02 2.689e+02 3.010e+02 5.051e+02, threshold=5.377e+02, percent-clipped=0.0 2023-10-09 17:08:44,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=74736.66666666667, ans=0.0 2023-10-09 17:08:55,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=74783.33333333333, ans=0.0 2023-10-09 17:08:56,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=74783.33333333333, ans=0.125 2023-10-09 17:08:58,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=74783.33333333333, ans=0.0 2023-10-09 17:09:14,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=74876.66666666667, ans=0.0 2023-10-09 17:09:57,481 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.98 vs. limit=6.0 2023-10-09 17:09:59,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-10-09 17:10:01,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=75063.33333333333, ans=0.0 2023-10-09 17:10:01,371 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 17:10:05,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=75110.0, ans=0.125 2023-10-09 17:10:17,046 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.32 vs. limit=22.5 2023-10-09 17:10:17,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=75156.66666666667, ans=0.035 2023-10-09 17:10:21,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 2.224e+02 2.662e+02 3.133e+02 4.356e+02, threshold=5.325e+02, percent-clipped=0.0 2023-10-09 17:10:23,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=75156.66666666667, ans=0.0 2023-10-09 17:10:32,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75203.33333333333, ans=0.1 2023-10-09 17:10:41,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=75250.0, ans=0.0 2023-10-09 17:10:52,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=75296.66666666667, ans=0.0 2023-10-09 17:11:03,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=75343.33333333333, ans=0.125 2023-10-09 17:11:07,056 INFO [train.py:1031] (0/4) Epoch 2, batch 2500, loss[loss=0.371, simple_loss=0.3914, pruned_loss=0.1753, over 15630.00 frames. ], tot_loss[loss=0.3152, simple_loss=0.3771, pruned_loss=0.1266, over 23405937.51 frames. ], batch size: 350, lr: 2.46e-02, grad_scale: 32.0 2023-10-09 17:11:11,530 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-10-09 17:11:15,539 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-10-09 17:11:17,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75390.0, ans=0.1 2023-10-09 17:11:58,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=75576.66666666667, ans=0.0 2023-10-09 17:12:09,388 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.03 vs. limit=15.0 2023-10-09 17:12:11,883 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 2.126e+02 2.552e+02 3.063e+02 4.345e+02, threshold=5.104e+02, percent-clipped=0.0 2023-10-09 17:12:18,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=75670.0, ans=0.125 2023-10-09 17:12:36,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=75763.33333333333, ans=0.05 2023-10-09 17:12:37,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=75763.33333333333, ans=0.125 2023-10-09 17:12:38,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=75763.33333333333, ans=0.0 2023-10-09 17:12:46,127 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-10-09 17:13:25,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75950.0, ans=0.1 2023-10-09 17:13:40,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=75996.66666666667, ans=0.1 2023-10-09 17:13:42,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76043.33333333333, ans=0.1 2023-10-09 17:13:44,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=76043.33333333333, ans=0.0 2023-10-09 17:13:46,270 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 17:13:54,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=76090.0, ans=0.0 2023-10-09 17:14:01,000 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.136e+02 2.326e+02 2.724e+02 4.312e+02, threshold=4.651e+02, percent-clipped=0.0 2023-10-09 17:14:02,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=76090.0, ans=0.125 2023-10-09 17:14:05,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=76136.66666666667, ans=0.125 2023-10-09 17:14:19,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=76183.33333333333, ans=0.1 2023-10-09 17:14:21,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=76183.33333333333, ans=0.5 2023-10-09 17:14:22,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=76183.33333333333, ans=0.125 2023-10-09 17:14:26,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=76230.0, ans=0.1 2023-10-09 17:14:26,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=76230.0, ans=0.1 2023-10-09 17:15:04,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=76370.0, ans=0.04949747468305833 2023-10-09 17:15:17,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=76416.66666666667, ans=0.2 2023-10-09 17:15:21,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.05 vs. limit=22.5 2023-10-09 17:15:34,847 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.86 vs. limit=22.5 2023-10-09 17:15:44,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=76510.0, ans=0.0 2023-10-09 17:16:16,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 2.021e+02 2.381e+02 2.875e+02 3.730e+02, threshold=4.762e+02, percent-clipped=0.0 2023-10-09 17:16:51,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=76696.66666666667, ans=0.125 2023-10-09 17:16:54,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76743.33333333333, ans=0.1 2023-10-09 17:17:43,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=76930.0, ans=0.125 2023-10-09 17:17:44,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.17 vs. limit=22.5 2023-10-09 17:17:44,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=76930.0, ans=0.0 2023-10-09 17:17:51,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.98 vs. limit=15.0 2023-10-09 17:17:59,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-10-09 17:18:09,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.21 vs. limit=15.0 2023-10-09 17:18:17,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=77023.33333333333, ans=0.125 2023-10-09 17:18:19,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.317e+02 2.669e+02 3.224e+02 6.283e+02, threshold=5.337e+02, percent-clipped=2.0 2023-10-09 17:18:19,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=77023.33333333333, ans=0.125 2023-10-09 17:18:24,330 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.71 vs. limit=15.0 2023-10-09 17:18:38,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=77116.66666666667, ans=0.0 2023-10-09 17:18:44,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77116.66666666667, ans=0.1 2023-10-09 17:18:45,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77163.33333333333, ans=0.1 2023-10-09 17:18:55,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=77163.33333333333, ans=0.0 2023-10-09 17:19:14,756 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-10-09 17:19:34,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=77303.33333333333, ans=0.0 2023-10-09 17:20:09,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=77443.33333333333, ans=0.125 2023-10-09 17:20:16,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=77490.0, ans=0.0 2023-10-09 17:20:19,429 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.81 vs. limit=15.0 2023-10-09 17:20:19,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 2.193e+02 2.477e+02 2.782e+02 5.717e+02, threshold=4.954e+02, percent-clipped=1.0 2023-10-09 17:20:54,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=77630.0, ans=0.125 2023-10-09 17:21:05,862 INFO [train.py:1031] (0/4) Epoch 2, batch 3000, loss[loss=0.2632, simple_loss=0.3443, pruned_loss=0.09109, over 16854.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.375, pruned_loss=0.1253, over 25480121.04 frames. ], batch size: 87, lr: 2.42e-02, grad_scale: 32.0 2023-10-09 17:21:10,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=77723.33333333333, ans=0.125 2023-10-09 17:21:33,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=77816.66666666667, ans=0.125 2023-10-09 17:21:45,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=77863.33333333333, ans=0.125 2023-10-09 17:22:10,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 2.080e+02 2.421e+02 2.879e+02 4.685e+02, threshold=4.842e+02, percent-clipped=0.0 2023-10-09 17:22:11,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=77956.66666666667, ans=0.125 2023-10-09 17:22:16,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=78003.33333333333, ans=0.125 2023-10-09 17:22:24,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=78050.0, ans=0.125 2023-10-09 17:22:26,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-10-09 17:22:45,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=78143.33333333333, ans=0.125 2023-10-09 17:22:58,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=78190.0, ans=0.125 2023-10-09 17:23:01,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=78190.0, ans=0.125 2023-10-09 17:23:23,493 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-10-09 17:23:27,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=78283.33333333333, ans=0.0 2023-10-09 17:24:03,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.169e+02 2.375e+02 2.769e+02 3.890e+02, threshold=4.750e+02, percent-clipped=0.0 2023-10-09 17:24:11,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=78470.0, ans=0.125 2023-10-09 17:24:30,838 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.58 vs. limit=10.0 2023-10-09 17:24:52,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=78656.66666666667, ans=10.0 2023-10-09 17:25:23,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=78750.0, ans=0.125 2023-10-09 17:25:27,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=78796.66666666667, ans=0.0 2023-10-09 17:25:55,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=78890.0, ans=0.0 2023-10-09 17:26:02,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=78890.0, ans=0.05 2023-10-09 17:26:05,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 2.124e+02 2.496e+02 2.948e+02 4.840e+02, threshold=4.992e+02, percent-clipped=1.0 2023-10-09 17:26:14,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=78936.66666666667, ans=0.125 2023-10-09 17:26:18,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.68 vs. limit=10.0 2023-10-09 17:26:21,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=78983.33333333333, ans=0.0 2023-10-09 17:26:33,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=79030.0, ans=0.0 2023-10-09 17:26:34,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=79030.0, ans=0.1 2023-10-09 17:26:47,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=79076.66666666667, ans=0.125 2023-10-09 17:26:55,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=79123.33333333333, ans=0.125 2023-10-09 17:27:35,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=79263.33333333333, ans=0.125 2023-10-09 17:27:40,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=79310.0, ans=0.0 2023-10-09 17:27:49,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=79356.66666666667, ans=0.2 2023-10-09 17:27:55,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.674e+02 2.132e+02 2.448e+02 2.724e+02 4.044e+02, threshold=4.896e+02, percent-clipped=0.0 2023-10-09 17:28:00,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=79403.33333333333, ans=0.125 2023-10-09 17:28:09,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.96 vs. limit=22.5 2023-10-09 17:28:24,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=79496.66666666667, ans=0.125 2023-10-09 17:28:31,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79496.66666666667, ans=0.1 2023-10-09 17:28:44,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=79543.33333333333, ans=0.0 2023-10-09 17:28:52,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=79590.0, ans=0.125 2023-10-09 17:29:17,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.87 vs. limit=15.0 2023-10-09 17:29:18,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=79683.33333333333, ans=0.125 2023-10-09 17:29:18,477 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.80 vs. limit=15.0 2023-10-09 17:29:25,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=79730.0, ans=0.125 2023-10-09 17:29:29,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=79730.0, ans=0.125 2023-10-09 17:29:52,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 2.122e+02 2.519e+02 2.921e+02 4.332e+02, threshold=5.037e+02, percent-clipped=0.0 2023-10-09 17:30:01,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=79870.0, ans=0.0 2023-10-09 17:30:04,643 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.42 vs. limit=10.0 2023-10-09 17:30:21,035 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.18 vs. limit=15.0 2023-10-09 17:30:21,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=79963.33333333333, ans=0.0 2023-10-09 17:30:26,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=79963.33333333333, ans=0.125 2023-10-09 17:30:40,257 INFO [train.py:1031] (0/4) Epoch 2, batch 3500, loss[loss=0.3095, simple_loss=0.378, pruned_loss=0.1205, over 16944.00 frames. ], tot_loss[loss=0.311, simple_loss=0.3737, pruned_loss=0.1242, over 27103938.06 frames. ], batch size: 165, lr: 2.39e-02, grad_scale: 32.0 2023-10-09 17:30:41,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.94 vs. limit=15.0 2023-10-09 17:30:57,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=80103.33333333333, ans=10.0 2023-10-09 17:31:46,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.213e+02 2.449e+02 2.841e+02 4.307e+02, threshold=4.898e+02, percent-clipped=0.0 2023-10-09 17:32:04,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=80383.33333333333, ans=0.2 2023-10-09 17:32:09,051 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-10-09 17:32:30,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=80476.66666666667, ans=0.125 2023-10-09 17:32:38,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=80476.66666666667, ans=0.125 2023-10-09 17:32:46,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=80523.33333333333, ans=0.0 2023-10-09 17:32:59,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=80570.0, ans=0.0 2023-10-09 17:32:59,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=80570.0, ans=0.09899494936611666 2023-10-09 17:33:03,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=80616.66666666667, ans=0.07 2023-10-09 17:33:31,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=80710.0, ans=0.1 2023-10-09 17:33:46,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 2.111e+02 2.346e+02 2.701e+02 4.177e+02, threshold=4.692e+02, percent-clipped=0.0 2023-10-09 17:33:52,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=80803.33333333333, ans=0.2 2023-10-09 17:34:01,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=80850.0, ans=0.1 2023-10-09 17:34:22,895 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.72 vs. limit=22.5 2023-10-09 17:34:25,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=80943.33333333333, ans=0.125 2023-10-09 17:34:53,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=81036.66666666667, ans=0.0 2023-10-09 17:34:59,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=81083.33333333333, ans=0.0 2023-10-09 17:35:16,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=81130.0, ans=0.125 2023-10-09 17:35:18,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=81130.0, ans=0.1 2023-10-09 17:35:43,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=81223.33333333333, ans=0.09899494936611666 2023-10-09 17:35:43,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.79 vs. limit=6.0 2023-10-09 17:35:44,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 2.023e+02 2.283e+02 2.727e+02 4.343e+02, threshold=4.565e+02, percent-clipped=0.0 2023-10-09 17:35:59,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=81316.66666666667, ans=0.2 2023-10-09 17:36:13,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=81363.33333333333, ans=0.125 2023-10-09 17:36:23,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=81410.0, ans=0.2 2023-10-09 17:36:30,966 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.68 vs. limit=15.0 2023-10-09 17:37:01,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=81550.0, ans=0.2 2023-10-09 17:37:02,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=81550.0, ans=0.0 2023-10-09 17:37:06,707 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-10-09 17:37:14,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=81596.66666666667, ans=0.125 2023-10-09 17:37:17,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=81596.66666666667, ans=0.125 2023-10-09 17:37:21,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.67 vs. limit=15.0 2023-10-09 17:37:26,055 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 17:37:30,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=81690.0, ans=0.1 2023-10-09 17:37:38,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.095e+02 2.292e+02 2.573e+02 3.685e+02, threshold=4.584e+02, percent-clipped=0.0 2023-10-09 17:37:38,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=81690.0, ans=0.09899494936611666 2023-10-09 17:37:47,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=81736.66666666667, ans=0.0 2023-10-09 17:37:59,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81783.33333333333, ans=0.1 2023-10-09 17:38:12,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=22.5 2023-10-09 17:38:12,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.45 vs. limit=15.0 2023-10-09 17:39:02,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=82063.33333333333, ans=0.0 2023-10-09 17:39:18,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=82110.0, ans=0.0 2023-10-09 17:39:30,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 2.084e+02 2.510e+02 2.904e+02 5.296e+02, threshold=5.020e+02, percent-clipped=4.0 2023-10-09 17:39:32,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=82203.33333333333, ans=0.125 2023-10-09 17:39:44,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=82250.0, ans=0.035 2023-10-09 17:39:50,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=82250.0, ans=0.07 2023-10-09 17:39:58,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.52 vs. limit=15.0 2023-10-09 17:40:18,362 INFO [train.py:1031] (0/4) Epoch 2, batch 4000, loss[loss=0.3266, simple_loss=0.3988, pruned_loss=0.1272, over 16591.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3722, pruned_loss=0.1234, over 28326990.02 frames. ], batch size: 219, lr: 2.37e-02, grad_scale: 32.0 2023-10-09 17:40:33,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=82436.66666666667, ans=0.0 2023-10-09 17:41:07,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=82576.66666666667, ans=0.0 2023-10-09 17:41:15,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=82623.33333333333, ans=0.0 2023-10-09 17:41:25,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 2.183e+02 2.508e+02 2.978e+02 4.155e+02, threshold=5.017e+02, percent-clipped=0.0 2023-10-09 17:41:28,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=82670.0, ans=0.1 2023-10-09 17:42:36,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=82950.0, ans=0.125 2023-10-09 17:42:40,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=82950.0, ans=0.125 2023-10-09 17:42:53,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=82996.66666666667, ans=0.125 2023-10-09 17:43:13,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=83090.0, ans=0.0 2023-10-09 17:43:17,351 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 2.276e+02 2.662e+02 3.204e+02 4.781e+02, threshold=5.324e+02, percent-clipped=0.0 2023-10-09 17:43:29,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=83136.66666666667, ans=0.0 2023-10-09 17:43:32,779 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 17:43:59,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=83230.0, ans=0.125 2023-10-09 17:44:02,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=83230.0, ans=0.125 2023-10-09 17:44:02,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=83230.0, ans=0.125 2023-10-09 17:44:03,162 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.56 vs. limit=15.0 2023-10-09 17:44:24,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=83323.33333333333, ans=0.0 2023-10-09 17:44:31,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=83370.0, ans=0.125 2023-10-09 17:44:35,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=83370.0, ans=0.125 2023-10-09 17:44:44,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=83370.0, ans=0.2 2023-10-09 17:44:45,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=83370.0, ans=0.125 2023-10-09 17:45:15,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=83510.0, ans=0.2 2023-10-09 17:45:16,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=83510.0, ans=0.125 2023-10-09 17:45:21,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=83510.0, ans=0.125 2023-10-09 17:45:37,680 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 2.118e+02 2.537e+02 2.829e+02 3.904e+02, threshold=5.074e+02, percent-clipped=0.0 2023-10-09 17:45:40,350 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-10-09 17:46:04,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=83696.66666666667, ans=0.1 2023-10-09 17:46:31,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=83790.0, ans=0.0 2023-10-09 17:46:35,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=83836.66666666667, ans=0.125 2023-10-09 17:46:42,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=83836.66666666667, ans=0.125 2023-10-09 17:47:20,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.50 vs. limit=15.0 2023-10-09 17:47:29,587 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 2.163e+02 2.368e+02 2.753e+02 3.959e+02, threshold=4.737e+02, percent-clipped=0.0 2023-10-09 17:47:34,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=84070.0, ans=0.0 2023-10-09 17:47:44,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=84116.66666666667, ans=0.125 2023-10-09 17:48:04,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=84163.33333333333, ans=0.0 2023-10-09 17:48:42,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=84350.0, ans=0.0 2023-10-09 17:49:33,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.247e+02 2.592e+02 2.795e+02 4.563e+02, threshold=5.184e+02, percent-clipped=0.0 2023-10-09 17:49:37,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=84536.66666666667, ans=0.0 2023-10-09 17:50:11,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=84676.66666666667, ans=0.125 2023-10-09 17:50:21,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=84676.66666666667, ans=0.125 2023-10-09 17:50:21,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=84676.66666666667, ans=0.025 2023-10-09 17:50:21,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.27 vs. limit=22.5 2023-10-09 17:50:23,739 INFO [train.py:1031] (0/4) Epoch 2, batch 4500, loss[loss=0.2883, simple_loss=0.3594, pruned_loss=0.1086, over 16755.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3721, pruned_loss=0.1228, over 29314309.96 frames. ], batch size: 81, lr: 2.34e-02, grad_scale: 32.0 2023-10-09 17:50:30,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=84723.33333333333, ans=0.125 2023-10-09 17:50:45,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=84816.66666666667, ans=0.125 2023-10-09 17:50:53,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=84816.66666666667, ans=0.0 2023-10-09 17:50:53,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=84816.66666666667, ans=0.125 2023-10-09 17:50:58,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=84863.33333333333, ans=0.125 2023-10-09 17:51:01,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=84863.33333333333, ans=0.0 2023-10-09 17:51:16,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=84910.0, ans=0.025 2023-10-09 17:51:29,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.981e+02 2.305e+02 2.900e+02 5.422e+02, threshold=4.610e+02, percent-clipped=3.0 2023-10-09 17:51:38,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=85003.33333333333, ans=0.125 2023-10-09 17:51:41,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=85050.0, ans=0.2 2023-10-09 17:51:49,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=85050.0, ans=0.125 2023-10-09 17:51:52,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=85096.66666666667, ans=0.2 2023-10-09 17:51:55,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=85096.66666666667, ans=0.5 2023-10-09 17:52:27,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=85236.66666666667, ans=0.2 2023-10-09 17:52:50,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=85330.0, ans=0.2 2023-10-09 17:53:02,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=85376.66666666667, ans=0.0 2023-10-09 17:53:06,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=85423.33333333333, ans=0.015 2023-10-09 17:53:09,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=85423.33333333333, ans=0.125 2023-10-09 17:53:15,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 2.103e+02 2.328e+02 2.713e+02 4.473e+02, threshold=4.656e+02, percent-clipped=0.0 2023-10-09 17:53:22,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=85470.0, ans=0.125 2023-10-09 17:53:23,573 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.21 vs. limit=12.0 2023-10-09 17:53:28,718 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-10-09 17:53:45,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=85563.33333333333, ans=0.0 2023-10-09 17:53:56,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85610.0, ans=0.1 2023-10-09 17:53:58,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=85610.0, ans=0.125 2023-10-09 17:54:08,406 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.81 vs. limit=12.0 2023-10-09 17:54:37,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=85750.0, ans=0.05 2023-10-09 17:54:41,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85796.66666666667, ans=0.1 2023-10-09 17:54:45,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=85796.66666666667, ans=0.125 2023-10-09 17:54:53,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=85843.33333333333, ans=0.125 2023-10-09 17:54:56,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=85843.33333333333, ans=0.0 2023-10-09 17:55:00,164 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.37 vs. limit=15.0 2023-10-09 17:55:10,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 2.318e+02 2.569e+02 3.235e+02 4.756e+02, threshold=5.137e+02, percent-clipped=1.0 2023-10-09 17:55:11,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=85936.66666666667, ans=0.125 2023-10-09 17:55:22,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=85983.33333333333, ans=0.035 2023-10-09 17:55:52,818 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.54 vs. limit=15.0 2023-10-09 17:55:54,699 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.51 vs. limit=15.0 2023-10-09 17:56:13,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=86170.0, ans=0.125 2023-10-09 17:56:33,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=86263.33333333333, ans=0.025 2023-10-09 17:56:42,002 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.82 vs. limit=6.0 2023-10-09 17:56:46,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=86310.0, ans=0.125 2023-10-09 17:57:02,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.250e+02 2.637e+02 3.037e+02 4.822e+02, threshold=5.274e+02, percent-clipped=0.0 2023-10-09 17:57:15,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=86450.0, ans=0.2 2023-10-09 17:57:32,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=86496.66666666667, ans=0.1 2023-10-09 17:57:35,404 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.90 vs. limit=15.0 2023-10-09 17:57:47,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=86590.0, ans=0.125 2023-10-09 17:57:50,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=86590.0, ans=0.0 2023-10-09 17:57:50,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=86590.0, ans=0.125 2023-10-09 17:57:58,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=86590.0, ans=0.125 2023-10-09 17:58:06,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=86636.66666666667, ans=0.125 2023-10-09 17:58:08,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=86636.66666666667, ans=0.0 2023-10-09 17:58:14,815 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-10-09 17:58:54,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=86823.33333333333, ans=0.0 2023-10-09 17:58:57,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.002e+02 2.329e+02 2.614e+02 3.537e+02, threshold=4.658e+02, percent-clipped=0.0 2023-10-09 17:59:00,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=86870.0, ans=0.125 2023-10-09 17:59:06,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=86870.0, ans=0.125 2023-10-09 17:59:10,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=86916.66666666667, ans=0.125 2023-10-09 17:59:28,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=86963.33333333333, ans=0.0 2023-10-09 17:59:28,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=86963.33333333333, ans=0.0 2023-10-09 17:59:43,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=87010.0, ans=0.125 2023-10-09 17:59:44,639 INFO [train.py:1031] (0/4) Epoch 2, batch 5000, loss[loss=0.3525, simple_loss=0.3977, pruned_loss=0.1536, over 16479.00 frames. ], tot_loss[loss=0.3077, simple_loss=0.371, pruned_loss=0.1222, over 30074125.66 frames. ], batch size: 266, lr: 2.31e-02, grad_scale: 32.0 2023-10-09 17:59:51,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=87056.66666666667, ans=0.2 2023-10-09 17:59:58,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=87103.33333333333, ans=0.2 2023-10-09 18:00:04,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=87103.33333333333, ans=0.5 2023-10-09 18:00:06,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=87103.33333333333, ans=0.125 2023-10-09 18:00:20,903 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.47 vs. limit=15.0 2023-10-09 18:00:24,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=87196.66666666667, ans=0.0 2023-10-09 18:00:25,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=87196.66666666667, ans=0.95 2023-10-09 18:00:42,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=87290.0, ans=0.0 2023-10-09 18:00:51,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 2.135e+02 2.489e+02 2.870e+02 4.390e+02, threshold=4.978e+02, percent-clipped=0.0 2023-10-09 18:00:52,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=87336.66666666667, ans=0.125 2023-10-09 18:00:53,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=87336.66666666667, ans=0.125 2023-10-09 18:01:03,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.70 vs. limit=22.5 2023-10-09 18:01:20,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=87430.0, ans=0.1 2023-10-09 18:01:21,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=87430.0, ans=0.125 2023-10-09 18:01:21,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=87430.0, ans=0.2 2023-10-09 18:02:06,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=87616.66666666667, ans=0.0 2023-10-09 18:02:16,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=87663.33333333333, ans=0.2 2023-10-09 18:02:23,824 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:02:28,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=87710.0, ans=0.09899494936611666 2023-10-09 18:02:48,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=87756.66666666667, ans=0.125 2023-10-09 18:02:49,729 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.067e+02 2.325e+02 2.720e+02 4.062e+02, threshold=4.650e+02, percent-clipped=0.0 2023-10-09 18:02:50,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=87803.33333333333, ans=0.0 2023-10-09 18:02:51,434 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.81 vs. limit=12.0 2023-10-09 18:03:11,332 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.82 vs. limit=15.0 2023-10-09 18:03:15,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=87896.66666666667, ans=0.125 2023-10-09 18:03:29,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.14 vs. limit=22.5 2023-10-09 18:03:36,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=87990.0, ans=0.125 2023-10-09 18:03:43,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=87990.0, ans=10.0 2023-10-09 18:04:37,816 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=22.5 2023-10-09 18:04:40,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 2.120e+02 2.354e+02 2.833e+02 4.221e+02, threshold=4.708e+02, percent-clipped=0.0 2023-10-09 18:04:45,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=88270.0, ans=0.0 2023-10-09 18:05:02,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88316.66666666667, ans=0.1 2023-10-09 18:06:03,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=88550.0, ans=0.0 2023-10-09 18:06:04,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=88550.0, ans=0.125 2023-10-09 18:06:25,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=88643.33333333333, ans=0.2 2023-10-09 18:06:27,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=88643.33333333333, ans=0.125 2023-10-09 18:06:33,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=88643.33333333333, ans=0.2 2023-10-09 18:06:37,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=88690.0, ans=0.125 2023-10-09 18:06:44,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=88690.0, ans=0.125 2023-10-09 18:06:49,977 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.669e+02 2.019e+02 2.227e+02 2.783e+02 4.113e+02, threshold=4.454e+02, percent-clipped=0.0 2023-10-09 18:07:00,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=88783.33333333333, ans=0.2 2023-10-09 18:07:17,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=88830.0, ans=0.0 2023-10-09 18:07:27,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=88876.66666666667, ans=0.0 2023-10-09 18:07:32,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=88876.66666666667, ans=0.125 2023-10-09 18:07:34,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=88923.33333333333, ans=0.0 2023-10-09 18:07:38,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=88923.33333333333, ans=0.125 2023-10-09 18:07:59,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=89016.66666666667, ans=0.125 2023-10-09 18:08:09,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=89016.66666666667, ans=0.125 2023-10-09 18:08:25,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=89110.0, ans=0.125 2023-10-09 18:08:39,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89156.66666666667, ans=0.1 2023-10-09 18:08:46,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 2.064e+02 2.291e+02 2.713e+02 5.102e+02, threshold=4.582e+02, percent-clipped=1.0 2023-10-09 18:08:57,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=89203.33333333333, ans=0.95 2023-10-09 18:09:11,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=89250.0, ans=0.0 2023-10-09 18:09:19,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=89296.66666666667, ans=0.125 2023-10-09 18:09:23,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=89296.66666666667, ans=0.125 2023-10-09 18:09:27,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=89343.33333333333, ans=0.0 2023-10-09 18:09:36,438 INFO [train.py:1031] (0/4) Epoch 2, batch 5500, loss[loss=0.278, simple_loss=0.3501, pruned_loss=0.103, over 15888.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3699, pruned_loss=0.1211, over 30680556.46 frames. ], batch size: 43, lr: 2.28e-02, grad_scale: 32.0 2023-10-09 18:09:44,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=89390.0, ans=0.0 2023-10-09 18:09:49,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=89436.66666666667, ans=0.0 2023-10-09 18:10:03,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=89483.33333333333, ans=0.0 2023-10-09 18:10:29,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=89623.33333333333, ans=10.0 2023-10-09 18:10:33,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.05 vs. limit=15.0 2023-10-09 18:10:36,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=89623.33333333333, ans=0.125 2023-10-09 18:10:36,891 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-10-09 18:10:37,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 2.043e+02 2.375e+02 2.896e+02 4.278e+02, threshold=4.750e+02, percent-clipped=0.0 2023-10-09 18:11:29,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=89856.66666666667, ans=0.5 2023-10-09 18:11:33,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=89903.33333333333, ans=0.125 2023-10-09 18:11:36,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=89903.33333333333, ans=0.0 2023-10-09 18:11:37,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.45 vs. limit=15.0 2023-10-09 18:11:46,766 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:12:23,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=90090.0, ans=0.125 2023-10-09 18:12:27,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 2.222e+02 2.575e+02 3.125e+02 5.556e+02, threshold=5.150e+02, percent-clipped=3.0 2023-10-09 18:12:30,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=90136.66666666667, ans=0.125 2023-10-09 18:12:31,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=90136.66666666667, ans=0.2 2023-10-09 18:12:49,695 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.65 vs. limit=15.0 2023-10-09 18:13:20,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=90323.33333333333, ans=0.125 2023-10-09 18:13:31,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=90370.0, ans=0.0 2023-10-09 18:13:33,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=90370.0, ans=0.0 2023-10-09 18:13:39,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90416.66666666667, ans=0.1 2023-10-09 18:13:40,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=90416.66666666667, ans=0.125 2023-10-09 18:14:02,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=90510.0, ans=0.0 2023-10-09 18:14:13,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=90556.66666666667, ans=0.0 2023-10-09 18:14:22,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.220e+02 2.533e+02 2.773e+02 4.529e+02, threshold=5.067e+02, percent-clipped=0.0 2023-10-09 18:14:27,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=90603.33333333333, ans=0.1 2023-10-09 18:14:30,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=90603.33333333333, ans=0.125 2023-10-09 18:14:30,559 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=12.0 2023-10-09 18:14:30,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.28 vs. limit=15.0 2023-10-09 18:14:31,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=90603.33333333333, ans=0.125 2023-10-09 18:14:32,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=90603.33333333333, ans=0.125 2023-10-09 18:14:51,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=90696.66666666667, ans=0.125 2023-10-09 18:15:17,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90790.0, ans=0.1 2023-10-09 18:15:44,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=90930.0, ans=0.125 2023-10-09 18:15:49,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=90930.0, ans=0.125 2023-10-09 18:15:58,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=90976.66666666667, ans=0.2 2023-10-09 18:15:59,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=90976.66666666667, ans=0.0 2023-10-09 18:16:07,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=91023.33333333333, ans=0.125 2023-10-09 18:16:15,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=91023.33333333333, ans=0.0 2023-10-09 18:16:15,705 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 2.025e+02 2.330e+02 2.746e+02 4.329e+02, threshold=4.659e+02, percent-clipped=0.0 2023-10-09 18:16:21,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=91070.0, ans=0.125 2023-10-09 18:16:28,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=91116.66666666667, ans=0.0 2023-10-09 18:16:33,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=91116.66666666667, ans=0.2 2023-10-09 18:16:53,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91210.0, ans=0.1 2023-10-09 18:16:53,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=91210.0, ans=0.2 2023-10-09 18:17:07,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=15.0 2023-10-09 18:17:24,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=91303.33333333333, ans=0.125 2023-10-09 18:17:32,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=91350.0, ans=0.125 2023-10-09 18:17:46,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=91396.66666666667, ans=0.0 2023-10-09 18:17:52,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=91396.66666666667, ans=0.125 2023-10-09 18:17:52,948 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=22.5 2023-10-09 18:18:15,187 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 2.119e+02 2.482e+02 2.997e+02 4.498e+02, threshold=4.963e+02, percent-clipped=0.0 2023-10-09 18:18:37,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.00 vs. limit=22.5 2023-10-09 18:18:47,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91630.0, ans=0.1 2023-10-09 18:18:48,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=91676.66666666667, ans=0.125 2023-10-09 18:18:54,031 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:18:57,843 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.33 vs. limit=22.5 2023-10-09 18:19:00,598 INFO [train.py:1031] (0/4) Epoch 2, batch 6000, loss[loss=0.3511, simple_loss=0.3953, pruned_loss=0.1535, over 16418.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3696, pruned_loss=0.1208, over 31162003.07 frames. ], batch size: 266, lr: 2.26e-02, grad_scale: 32.0 2023-10-09 18:19:06,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=91723.33333333333, ans=0.125 2023-10-09 18:19:07,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=91723.33333333333, ans=0.2 2023-10-09 18:19:37,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=91863.33333333333, ans=0.07 2023-10-09 18:19:43,407 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-10-09 18:19:58,507 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.05 vs. limit=15.0 2023-10-09 18:20:09,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.201e+02 2.469e+02 2.948e+02 4.395e+02, threshold=4.938e+02, percent-clipped=0.0 2023-10-09 18:20:47,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=92143.33333333333, ans=0.125 2023-10-09 18:21:29,040 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.91 vs. limit=15.0 2023-10-09 18:21:40,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=92330.0, ans=0.125 2023-10-09 18:21:45,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=92376.66666666667, ans=0.0 2023-10-09 18:21:55,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=92423.33333333333, ans=0.1 2023-10-09 18:21:56,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=92423.33333333333, ans=0.125 2023-10-09 18:22:05,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 2.170e+02 2.435e+02 2.810e+02 4.259e+02, threshold=4.871e+02, percent-clipped=0.0 2023-10-09 18:22:09,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.15 vs. limit=15.0 2023-10-09 18:22:13,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=92470.0, ans=0.0 2023-10-09 18:22:16,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=92470.0, ans=0.125 2023-10-09 18:22:30,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=92516.66666666667, ans=0.125 2023-10-09 18:22:40,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.56 vs. limit=10.0 2023-10-09 18:22:52,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=92610.0, ans=0.2 2023-10-09 18:22:53,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=92610.0, ans=0.0 2023-10-09 18:22:53,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=92610.0, ans=0.2 2023-10-09 18:22:59,355 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-10-09 18:23:04,342 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-10-09 18:23:17,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=92703.33333333333, ans=0.0 2023-10-09 18:23:52,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=92843.33333333333, ans=0.04949747468305833 2023-10-09 18:24:01,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=92890.0, ans=0.0 2023-10-09 18:24:03,831 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 2.089e+02 2.305e+02 2.729e+02 3.889e+02, threshold=4.611e+02, percent-clipped=0.0 2023-10-09 18:24:11,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=92936.66666666667, ans=0.0 2023-10-09 18:24:20,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=92983.33333333333, ans=12.0 2023-10-09 18:24:24,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=92983.33333333333, ans=0.0 2023-10-09 18:24:27,823 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.26 vs. limit=6.0 2023-10-09 18:25:19,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=93170.0, ans=0.0 2023-10-09 18:25:19,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=93170.0, ans=0.125 2023-10-09 18:25:20,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=93170.0, ans=0.125 2023-10-09 18:25:40,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=93263.33333333333, ans=0.1 2023-10-09 18:25:44,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93263.33333333333, ans=0.1 2023-10-09 18:25:44,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=93263.33333333333, ans=0.05 2023-10-09 18:25:45,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=93263.33333333333, ans=0.125 2023-10-09 18:25:50,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=93310.0, ans=0.125 2023-10-09 18:26:03,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=93356.66666666667, ans=12.0 2023-10-09 18:26:12,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.217e+02 2.714e+02 3.194e+02 4.455e+02, threshold=5.429e+02, percent-clipped=0.0 2023-10-09 18:26:15,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=93403.33333333333, ans=0.125 2023-10-09 18:26:24,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=93403.33333333333, ans=15.0 2023-10-09 18:26:29,785 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.59 vs. limit=15.0 2023-10-09 18:26:45,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=93496.66666666667, ans=0.125 2023-10-09 18:26:59,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=93543.33333333333, ans=0.125 2023-10-09 18:27:05,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=93590.0, ans=0.125 2023-10-09 18:27:32,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=93683.33333333333, ans=0.125 2023-10-09 18:27:53,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=93776.66666666667, ans=0.125 2023-10-09 18:28:10,307 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 2.048e+02 2.318e+02 2.828e+02 3.945e+02, threshold=4.636e+02, percent-clipped=0.0 2023-10-09 18:28:16,670 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.26 vs. limit=22.5 2023-10-09 18:28:19,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=93870.0, ans=0.125 2023-10-09 18:28:27,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=93916.66666666667, ans=0.0 2023-10-09 18:28:28,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=93916.66666666667, ans=15.0 2023-10-09 18:28:44,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.23 vs. limit=15.0 2023-10-09 18:28:58,067 INFO [train.py:1031] (0/4) Epoch 2, batch 6500, loss[loss=0.3344, simple_loss=0.3939, pruned_loss=0.1375, over 16536.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3694, pruned_loss=0.1203, over 31558533.44 frames. ], batch size: 219, lr: 2.23e-02, grad_scale: 32.0 2023-10-09 18:29:02,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=94056.66666666667, ans=0.125 2023-10-09 18:29:19,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.81 vs. limit=5.0 2023-10-09 18:29:31,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=94150.0, ans=0.125 2023-10-09 18:29:59,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=94243.33333333333, ans=0.07 2023-10-09 18:30:05,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=94243.33333333333, ans=0.0 2023-10-09 18:30:11,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=94290.0, ans=0.125 2023-10-09 18:30:21,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.124e+02 2.381e+02 2.650e+02 4.551e+02, threshold=4.762e+02, percent-clipped=0.0 2023-10-09 18:30:22,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=94336.66666666667, ans=0.125 2023-10-09 18:30:27,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=94336.66666666667, ans=0.0 2023-10-09 18:30:37,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=94383.33333333333, ans=0.125 2023-10-09 18:30:59,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=94476.66666666667, ans=0.125 2023-10-09 18:32:01,559 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-10-09 18:32:07,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=94710.0, ans=0.125 2023-10-09 18:32:28,032 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 2.216e+02 2.410e+02 3.051e+02 5.962e+02, threshold=4.819e+02, percent-clipped=3.0 2023-10-09 18:32:51,530 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.82 vs. limit=10.0 2023-10-09 18:33:17,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=94990.0, ans=0.0 2023-10-09 18:33:54,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=95130.0, ans=0.125 2023-10-09 18:34:02,846 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.19 vs. limit=6.0 2023-10-09 18:34:16,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=95223.33333333333, ans=0.125 2023-10-09 18:34:19,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=95223.33333333333, ans=22.5 2023-10-09 18:34:22,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95223.33333333333, ans=0.1 2023-10-09 18:34:25,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.045e+02 2.320e+02 2.751e+02 4.314e+02, threshold=4.640e+02, percent-clipped=0.0 2023-10-09 18:34:33,151 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:34:40,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=95316.66666666667, ans=0.125 2023-10-09 18:35:02,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.50 vs. limit=15.0 2023-10-09 18:35:13,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=95410.0, ans=0.125 2023-10-09 18:35:23,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=95456.66666666667, ans=0.125 2023-10-09 18:35:28,430 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.71 vs. limit=22.5 2023-10-09 18:35:33,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=95503.33333333333, ans=0.0 2023-10-09 18:35:41,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=95503.33333333333, ans=0.2 2023-10-09 18:35:56,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=95550.0, ans=0.125 2023-10-09 18:36:01,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.46 vs. limit=22.5 2023-10-09 18:36:38,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.877e+02 2.135e+02 2.470e+02 4.497e+02, threshold=4.270e+02, percent-clipped=0.0 2023-10-09 18:37:33,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=95923.33333333333, ans=0.2 2023-10-09 18:37:35,850 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:37:40,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.01 vs. limit=22.5 2023-10-09 18:37:46,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=95970.0, ans=0.125 2023-10-09 18:38:12,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=96063.33333333333, ans=0.125 2023-10-09 18:38:13,465 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.74 vs. limit=15.0 2023-10-09 18:38:20,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=96110.0, ans=0.125 2023-10-09 18:38:24,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=96110.0, ans=0.0 2023-10-09 18:38:38,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2023-10-09 18:38:43,218 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.184e+02 2.413e+02 2.807e+02 3.783e+02, threshold=4.827e+02, percent-clipped=0.0 2023-10-09 18:38:43,684 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:38:49,908 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.68 vs. limit=15.0 2023-10-09 18:38:58,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=96250.0, ans=0.125 2023-10-09 18:38:58,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=96250.0, ans=0.0 2023-10-09 18:39:00,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=96250.0, ans=0.125 2023-10-09 18:39:02,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-10-09 18:39:18,424 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=15.0 2023-10-09 18:39:20,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=96343.33333333333, ans=0.05 2023-10-09 18:39:28,372 INFO [train.py:1031] (0/4) Epoch 2, batch 7000, loss[loss=0.3459, simple_loss=0.4034, pruned_loss=0.1442, over 16066.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3687, pruned_loss=0.1191, over 31851427.12 frames. ], batch size: 43, lr: 2.21e-02, grad_scale: 32.0 2023-10-09 18:39:28,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=96390.0, ans=0.125 2023-10-09 18:40:12,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=96483.33333333333, ans=0.125 2023-10-09 18:40:15,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96530.0, ans=0.1 2023-10-09 18:40:18,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=96530.0, ans=0.125 2023-10-09 18:40:28,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=96576.66666666667, ans=0.125 2023-10-09 18:40:42,726 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.71 vs. limit=12.0 2023-10-09 18:40:45,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.087e+02 2.382e+02 2.743e+02 4.279e+02, threshold=4.765e+02, percent-clipped=0.0 2023-10-09 18:40:51,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=96670.0, ans=10.0 2023-10-09 18:41:04,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=96716.66666666667, ans=0.125 2023-10-09 18:41:07,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=96763.33333333333, ans=0.025 2023-10-09 18:41:07,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=96763.33333333333, ans=0.2 2023-10-09 18:41:08,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=96763.33333333333, ans=0.05 2023-10-09 18:41:11,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=96763.33333333333, ans=0.0 2023-10-09 18:41:32,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=96856.66666666667, ans=0.125 2023-10-09 18:41:36,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=96856.66666666667, ans=0.1 2023-10-09 18:41:42,327 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.94 vs. limit=15.0 2023-10-09 18:41:44,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=96903.33333333333, ans=0.0 2023-10-09 18:42:00,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=96950.0, ans=0.0 2023-10-09 18:42:04,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=96996.66666666667, ans=0.0 2023-10-09 18:42:13,828 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=24.40 vs. limit=15.0 2023-10-09 18:42:14,480 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:42:26,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97043.33333333333, ans=0.1 2023-10-09 18:42:41,742 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.102e+02 2.329e+02 2.621e+02 4.367e+02, threshold=4.657e+02, percent-clipped=0.0 2023-10-09 18:42:52,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97183.33333333333, ans=0.125 2023-10-09 18:43:09,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=97230.0, ans=0.1 2023-10-09 18:43:10,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=97230.0, ans=0.1 2023-10-09 18:43:11,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=97230.0, ans=0.1 2023-10-09 18:43:47,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=97370.0, ans=0.0 2023-10-09 18:43:54,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=97370.0, ans=0.025 2023-10-09 18:44:31,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=97510.0, ans=0.05 2023-10-09 18:44:32,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=97510.0, ans=0.125 2023-10-09 18:44:32,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=97510.0, ans=0.0 2023-10-09 18:44:34,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=97510.0, ans=0.125 2023-10-09 18:44:46,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=97556.66666666667, ans=0.125 2023-10-09 18:44:51,291 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 2.121e+02 2.342e+02 2.687e+02 3.873e+02, threshold=4.685e+02, percent-clipped=0.0 2023-10-09 18:44:55,925 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:45:02,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=97650.0, ans=0.125 2023-10-09 18:45:03,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=97650.0, ans=0.0 2023-10-09 18:45:03,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=97650.0, ans=0.09899494936611666 2023-10-09 18:45:05,041 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.038e-02 2023-10-09 18:45:17,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=97696.66666666667, ans=0.035 2023-10-09 18:45:25,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=97743.33333333333, ans=0.125 2023-10-09 18:45:37,461 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:46:02,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=97883.33333333333, ans=0.125 2023-10-09 18:46:05,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=97883.33333333333, ans=0.0 2023-10-09 18:46:10,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=97883.33333333333, ans=0.0 2023-10-09 18:46:49,804 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.546e-03 2023-10-09 18:46:51,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.968e+02 2.283e+02 2.806e+02 4.340e+02, threshold=4.567e+02, percent-clipped=0.0 2023-10-09 18:47:01,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=98070.0, ans=15.0 2023-10-09 18:47:39,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=98256.66666666667, ans=0.125 2023-10-09 18:47:48,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=98303.33333333333, ans=0.0 2023-10-09 18:47:51,172 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-10-09 18:47:57,912 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.79 vs. limit=15.0 2023-10-09 18:48:08,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=98396.66666666667, ans=0.125 2023-10-09 18:48:12,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=98396.66666666667, ans=6.0 2023-10-09 18:48:40,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 2.086e+02 2.361e+02 2.719e+02 4.040e+02, threshold=4.721e+02, percent-clipped=0.0 2023-10-09 18:48:54,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98583.33333333333, ans=0.1 2023-10-09 18:49:10,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=98630.0, ans=0.1 2023-10-09 18:49:12,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=98676.66666666667, ans=0.125 2023-10-09 18:49:24,324 INFO [train.py:1031] (0/4) Epoch 2, batch 7500, loss[loss=0.3432, simple_loss=0.3957, pruned_loss=0.1454, over 16071.00 frames. ], tot_loss[loss=0.3027, simple_loss=0.3679, pruned_loss=0.1187, over 32046161.58 frames. ], batch size: 297, lr: 2.19e-02, grad_scale: 32.0 2023-10-09 18:50:02,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-10-09 18:50:25,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=98956.66666666667, ans=0.125 2023-10-09 18:50:29,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=98956.66666666667, ans=0.125 2023-10-09 18:50:35,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.126e+02 2.486e+02 3.068e+02 5.361e+02, threshold=4.972e+02, percent-clipped=5.0 2023-10-09 18:50:39,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=99003.33333333333, ans=0.1 2023-10-09 18:50:50,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=99050.0, ans=0.07 2023-10-09 18:50:57,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=99096.66666666667, ans=0.125 2023-10-09 18:50:59,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-10-09 18:51:00,233 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.95 vs. limit=22.5 2023-10-09 18:51:05,586 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:51:31,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=99236.66666666667, ans=0.125 2023-10-09 18:51:36,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=99236.66666666667, ans=0.125 2023-10-09 18:52:03,417 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-10-09 18:52:03,649 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.61 vs. limit=22.5 2023-10-09 18:52:37,256 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 2.029e+02 2.260e+02 2.569e+02 4.157e+02, threshold=4.520e+02, percent-clipped=0.0 2023-10-09 18:52:39,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99470.0, ans=0.1 2023-10-09 18:52:56,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=99516.66666666667, ans=0.0 2023-10-09 18:53:02,312 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:53:02,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=99563.33333333333, ans=0.0 2023-10-09 18:53:03,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.54 vs. limit=15.0 2023-10-09 18:53:13,241 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.96 vs. limit=15.0 2023-10-09 18:53:46,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.52 vs. limit=15.0 2023-10-09 18:54:06,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=99843.33333333333, ans=0.125 2023-10-09 18:54:14,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=99843.33333333333, ans=0.125 2023-10-09 18:54:28,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.932e+02 2.184e+02 2.548e+02 4.032e+02, threshold=4.368e+02, percent-clipped=0.0 2023-10-09 18:54:29,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=99936.66666666667, ans=0.0 2023-10-09 18:54:41,023 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.16 vs. limit=15.0 2023-10-09 18:54:53,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=99983.33333333333, ans=0.0 2023-10-09 18:54:58,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=100030.0, ans=0.125 2023-10-09 18:55:07,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=100076.66666666667, ans=0.125 2023-10-09 18:55:17,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=100123.33333333333, ans=0.0 2023-10-09 18:55:20,513 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:55:47,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=100216.66666666667, ans=0.125 2023-10-09 18:55:48,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=12.0 2023-10-09 18:55:57,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=100263.33333333333, ans=0.0 2023-10-09 18:56:00,293 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.19 vs. limit=10.0 2023-10-09 18:56:01,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=100263.33333333333, ans=0.0 2023-10-09 18:56:08,241 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:56:32,503 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.242e+02 2.671e+02 3.272e+02 4.658e+02, threshold=5.342e+02, percent-clipped=4.0 2023-10-09 18:56:38,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=100403.33333333333, ans=0.125 2023-10-09 18:56:49,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.23 vs. limit=10.0 2023-10-09 18:57:08,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=100543.33333333333, ans=0.125 2023-10-09 18:57:22,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=100590.0, ans=0.125 2023-10-09 18:57:29,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.80 vs. limit=22.5 2023-10-09 18:57:51,689 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=15.0 2023-10-09 18:58:03,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=100776.66666666667, ans=0.125 2023-10-09 18:58:18,703 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:58:26,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.879e+02 2.190e+02 2.546e+02 4.210e+02, threshold=4.380e+02, percent-clipped=0.0 2023-10-09 18:58:59,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=100963.33333333333, ans=0.0 2023-10-09 18:59:03,835 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 18:59:17,706 INFO [train.py:1031] (0/4) Epoch 2, batch 8000, loss[loss=0.2712, simple_loss=0.3502, pruned_loss=0.09607, over 16888.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3663, pruned_loss=0.1172, over 32216625.45 frames. ], batch size: 77, lr: 2.16e-02, grad_scale: 32.0 2023-10-09 18:59:29,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=101103.33333333333, ans=0.1 2023-10-09 18:59:40,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=101150.0, ans=0.125 2023-10-09 18:59:58,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=101196.66666666667, ans=0.125 2023-10-09 18:59:59,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=101196.66666666667, ans=0.125 2023-10-09 19:00:04,011 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.86 vs. limit=15.0 2023-10-09 19:00:19,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101290.0, ans=0.1 2023-10-09 19:00:24,584 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.970e+02 2.222e+02 2.628e+02 4.102e+02, threshold=4.444e+02, percent-clipped=0.0 2023-10-09 19:00:29,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=101336.66666666667, ans=0.125 2023-10-09 19:00:31,677 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.75 vs. limit=22.5 2023-10-09 19:00:36,799 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:00:41,981 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2023-10-09 19:00:47,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=101430.0, ans=0.125 2023-10-09 19:00:50,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=101430.0, ans=0.125 2023-10-09 19:00:51,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=101430.0, ans=0.0 2023-10-09 19:01:04,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=101476.66666666667, ans=0.125 2023-10-09 19:01:20,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=101570.0, ans=0.0 2023-10-09 19:01:24,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=101570.0, ans=0.0 2023-10-09 19:01:54,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=101710.0, ans=0.0 2023-10-09 19:02:21,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 2.056e+02 2.325e+02 2.662e+02 3.495e+02, threshold=4.651e+02, percent-clipped=0.0 2023-10-09 19:02:39,012 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.36 vs. limit=12.0 2023-10-09 19:02:49,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=101896.66666666667, ans=0.125 2023-10-09 19:03:29,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=102036.66666666667, ans=0.0 2023-10-09 19:03:32,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=102036.66666666667, ans=0.05 2023-10-09 19:03:36,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102036.66666666667, ans=0.1 2023-10-09 19:03:51,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102083.33333333333, ans=0.1 2023-10-09 19:03:51,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=102083.33333333333, ans=0.0 2023-10-09 19:03:59,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=102130.0, ans=0.95 2023-10-09 19:04:08,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.87 vs. limit=22.5 2023-10-09 19:04:17,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.00 vs. limit=15.0 2023-10-09 19:04:27,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 2.008e+02 2.236e+02 2.608e+02 4.426e+02, threshold=4.472e+02, percent-clipped=0.0 2023-10-09 19:04:45,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=102316.66666666667, ans=0.2 2023-10-09 19:04:47,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.09 vs. limit=22.5 2023-10-09 19:04:52,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=102363.33333333333, ans=0.1 2023-10-09 19:05:00,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=102363.33333333333, ans=0.125 2023-10-09 19:05:07,174 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.69 vs. limit=15.0 2023-10-09 19:05:15,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=102456.66666666667, ans=0.125 2023-10-09 19:05:36,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=102503.33333333333, ans=0.0 2023-10-09 19:05:41,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=102550.0, ans=0.0 2023-10-09 19:05:53,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=102596.66666666667, ans=0.015 2023-10-09 19:06:22,731 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 1.978e+02 2.282e+02 2.706e+02 4.029e+02, threshold=4.564e+02, percent-clipped=0.0 2023-10-09 19:06:23,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=102736.66666666667, ans=0.125 2023-10-09 19:06:33,071 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=15.0 2023-10-09 19:06:39,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=102783.33333333333, ans=0.0 2023-10-09 19:06:58,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=102876.66666666667, ans=0.0 2023-10-09 19:06:59,615 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=12.0 2023-10-09 19:07:03,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=102876.66666666667, ans=0.2 2023-10-09 19:07:06,802 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.08 vs. limit=22.5 2023-10-09 19:07:13,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=102923.33333333333, ans=0.1 2023-10-09 19:07:21,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=102970.0, ans=0.125 2023-10-09 19:07:23,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=102970.0, ans=0.2 2023-10-09 19:07:26,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=102970.0, ans=0.125 2023-10-09 19:07:31,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=102970.0, ans=0.0 2023-10-09 19:07:37,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=103016.66666666667, ans=0.0 2023-10-09 19:07:40,587 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.88 vs. limit=10.0 2023-10-09 19:08:00,217 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-10-09 19:08:02,384 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:08:04,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=103110.0, ans=0.125 2023-10-09 19:08:19,272 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 2.094e+02 2.434e+02 2.810e+02 4.004e+02, threshold=4.869e+02, percent-clipped=0.0 2023-10-09 19:08:24,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=103203.33333333333, ans=0.1 2023-10-09 19:08:31,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=103250.0, ans=10.0 2023-10-09 19:08:44,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=103296.66666666667, ans=0.025 2023-10-09 19:08:48,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=103296.66666666667, ans=0.0 2023-10-09 19:08:52,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=103296.66666666667, ans=0.0 2023-10-09 19:08:55,726 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.42 vs. limit=22.5 2023-10-09 19:08:58,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=103343.33333333333, ans=0.5 2023-10-09 19:09:07,673 INFO [train.py:1031] (0/4) Epoch 2, batch 8500, loss[loss=0.3611, simple_loss=0.4059, pruned_loss=0.1581, over 16693.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3657, pruned_loss=0.1163, over 32357938.48 frames. ], batch size: 202, lr: 2.14e-02, grad_scale: 32.0 2023-10-09 19:09:26,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=103436.66666666667, ans=0.1 2023-10-09 19:09:29,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=103483.33333333333, ans=0.0 2023-10-09 19:09:30,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=103483.33333333333, ans=0.2 2023-10-09 19:09:41,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=103530.0, ans=0.125 2023-10-09 19:09:48,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=103530.0, ans=0.125 2023-10-09 19:09:50,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=103576.66666666667, ans=0.125 2023-10-09 19:09:55,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=103576.66666666667, ans=0.2 2023-10-09 19:10:10,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.183e+02 2.488e+02 2.913e+02 3.994e+02, threshold=4.977e+02, percent-clipped=0.0 2023-10-09 19:10:14,174 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.81 vs. limit=15.0 2023-10-09 19:10:14,253 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.84 vs. limit=15.0 2023-10-09 19:10:14,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.74 vs. limit=10.0 2023-10-09 19:10:15,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=103670.0, ans=0.07 2023-10-09 19:10:30,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=103716.66666666667, ans=0.0 2023-10-09 19:10:36,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103763.33333333333, ans=0.1 2023-10-09 19:11:12,873 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.72 vs. limit=15.0 2023-10-09 19:11:21,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103903.33333333333, ans=0.1 2023-10-09 19:11:34,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=103950.0, ans=0.0 2023-10-09 19:11:46,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=104043.33333333333, ans=0.0 2023-10-09 19:11:55,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=104043.33333333333, ans=0.125 2023-10-09 19:12:15,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104090.0, ans=0.1 2023-10-09 19:12:15,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=104090.0, ans=0.1 2023-10-09 19:12:21,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=104136.66666666667, ans=0.0 2023-10-09 19:12:21,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.904e+02 2.220e+02 2.528e+02 3.906e+02, threshold=4.440e+02, percent-clipped=0.0 2023-10-09 19:13:04,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.62 vs. limit=15.0 2023-10-09 19:13:11,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=104276.66666666667, ans=0.0 2023-10-09 19:13:30,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=104370.0, ans=0.125 2023-10-09 19:13:32,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=104370.0, ans=0.125 2023-10-09 19:13:33,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=104416.66666666667, ans=0.04949747468305833 2023-10-09 19:13:39,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=104416.66666666667, ans=0.2 2023-10-09 19:13:47,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=104463.33333333333, ans=0.0 2023-10-09 19:13:48,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=104463.33333333333, ans=0.2 2023-10-09 19:13:48,849 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.35 vs. limit=10.0 2023-10-09 19:14:04,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=104510.0, ans=0.125 2023-10-09 19:14:12,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=104556.66666666667, ans=0.1 2023-10-09 19:14:23,074 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.04 vs. limit=15.0 2023-10-09 19:14:23,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 2.086e+02 2.326e+02 2.682e+02 3.738e+02, threshold=4.653e+02, percent-clipped=0.0 2023-10-09 19:14:23,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104603.33333333333, ans=0.1 2023-10-09 19:14:23,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104603.33333333333, ans=0.1 2023-10-09 19:14:27,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=104603.33333333333, ans=0.0 2023-10-09 19:14:40,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104650.0, ans=0.1 2023-10-09 19:15:02,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=104743.33333333333, ans=0.05 2023-10-09 19:15:02,494 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.43 vs. limit=22.5 2023-10-09 19:15:28,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=104836.66666666667, ans=0.125 2023-10-09 19:15:40,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=104883.33333333333, ans=0.1 2023-10-09 19:15:43,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=104883.33333333333, ans=0.125 2023-10-09 19:15:47,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.67 vs. limit=22.5 2023-10-09 19:15:55,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=104930.0, ans=0.125 2023-10-09 19:16:11,011 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:16:18,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 2.027e+02 2.311e+02 2.736e+02 3.819e+02, threshold=4.621e+02, percent-clipped=0.0 2023-10-09 19:16:24,705 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-10-09 19:16:53,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=105210.0, ans=0.125 2023-10-09 19:17:07,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105256.66666666667, ans=0.1 2023-10-09 19:17:12,426 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.73 vs. limit=22.5 2023-10-09 19:17:43,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=105396.66666666667, ans=0.0 2023-10-09 19:18:02,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=105490.0, ans=0.125 2023-10-09 19:18:02,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=105490.0, ans=0.5 2023-10-09 19:18:05,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.72 vs. limit=6.0 2023-10-09 19:18:05,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=105490.0, ans=0.1 2023-10-09 19:18:08,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 2.011e+02 2.291e+02 2.529e+02 4.186e+02, threshold=4.582e+02, percent-clipped=0.0 2023-10-09 19:18:10,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=105536.66666666667, ans=0.2 2023-10-09 19:18:11,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=105536.66666666667, ans=0.0 2023-10-09 19:18:30,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=105583.33333333333, ans=0.0 2023-10-09 19:18:30,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105583.33333333333, ans=0.1 2023-10-09 19:18:52,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105676.66666666667, ans=0.1 2023-10-09 19:18:55,802 INFO [train.py:1031] (0/4) Epoch 2, batch 9000, loss[loss=0.3145, simple_loss=0.3893, pruned_loss=0.1199, over 16784.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3644, pruned_loss=0.1155, over 32474133.92 frames. ], batch size: 188, lr: 2.12e-02, grad_scale: 64.0 2023-10-09 19:18:56,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=105723.33333333333, ans=0.125 2023-10-09 19:18:58,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=105723.33333333333, ans=0.125 2023-10-09 19:19:32,733 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.72 vs. limit=15.0 2023-10-09 19:19:32,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.03 vs. limit=15.0 2023-10-09 19:19:40,462 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.09 vs. limit=15.0 2023-10-09 19:19:51,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=105956.66666666667, ans=0.125 2023-10-09 19:19:52,683 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-10-09 19:20:02,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.083e+02 2.370e+02 2.904e+02 4.740e+02, threshold=4.740e+02, percent-clipped=1.0 2023-10-09 19:20:04,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=106003.33333333333, ans=0.0 2023-10-09 19:20:18,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106050.0, ans=0.1 2023-10-09 19:20:33,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=106143.33333333333, ans=0.0 2023-10-09 19:20:39,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=106143.33333333333, ans=0.0 2023-10-09 19:20:53,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=106190.0, ans=0.05 2023-10-09 19:21:05,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-10-09 19:21:05,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=106283.33333333333, ans=0.125 2023-10-09 19:21:15,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=106283.33333333333, ans=0.0 2023-10-09 19:21:16,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=106283.33333333333, ans=0.125 2023-10-09 19:21:25,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=106330.0, ans=0.09899494936611666 2023-10-09 19:21:25,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=106330.0, ans=0.125 2023-10-09 19:21:51,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 2.116e+02 2.519e+02 3.022e+02 5.639e+02, threshold=5.039e+02, percent-clipped=2.0 2023-10-09 19:21:57,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=106470.0, ans=0.125 2023-10-09 19:21:59,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=106470.0, ans=0.025 2023-10-09 19:22:02,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=106516.66666666667, ans=0.04949747468305833 2023-10-09 19:22:12,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=106563.33333333333, ans=0.125 2023-10-09 19:22:13,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=106563.33333333333, ans=0.0 2023-10-09 19:22:23,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=106610.0, ans=0.125 2023-10-09 19:22:26,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=106610.0, ans=0.2 2023-10-09 19:22:43,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106703.33333333333, ans=0.1 2023-10-09 19:22:48,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=106703.33333333333, ans=0.125 2023-10-09 19:22:49,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=106703.33333333333, ans=0.125 2023-10-09 19:22:57,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=106750.0, ans=0.125 2023-10-09 19:23:07,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=106796.66666666667, ans=0.125 2023-10-09 19:23:40,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 2.018e+02 2.303e+02 2.777e+02 4.119e+02, threshold=4.606e+02, percent-clipped=0.0 2023-10-09 19:23:55,938 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.91 vs. limit=10.0 2023-10-09 19:23:56,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=106983.33333333333, ans=0.125 2023-10-09 19:23:57,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=106983.33333333333, ans=0.125 2023-10-09 19:23:58,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=106983.33333333333, ans=0.2 2023-10-09 19:24:04,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=107030.0, ans=0.1 2023-10-09 19:24:23,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107123.33333333333, ans=0.1 2023-10-09 19:24:40,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=107170.0, ans=0.125 2023-10-09 19:24:42,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=107170.0, ans=0.0 2023-10-09 19:24:45,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=107216.66666666667, ans=0.2 2023-10-09 19:24:47,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=107216.66666666667, ans=0.125 2023-10-09 19:24:56,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=107263.33333333333, ans=0.0 2023-10-09 19:24:57,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-10-09 19:25:15,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=107310.0, ans=0.125 2023-10-09 19:25:21,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=107310.0, ans=0.125 2023-10-09 19:25:25,521 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.96 vs. limit=15.0 2023-10-09 19:25:36,410 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-10-09 19:25:38,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 2.245e+02 2.529e+02 2.814e+02 4.044e+02, threshold=5.058e+02, percent-clipped=0.0 2023-10-09 19:25:51,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107403.33333333333, ans=0.1 2023-10-09 19:25:58,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-10-09 19:26:01,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107450.0, ans=0.1 2023-10-09 19:26:16,849 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.79 vs. limit=10.0 2023-10-09 19:26:23,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=107543.33333333333, ans=0.125 2023-10-09 19:26:35,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=107590.0, ans=0.125 2023-10-09 19:26:44,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=107636.66666666667, ans=0.2 2023-10-09 19:26:52,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=107683.33333333333, ans=0.0 2023-10-09 19:27:21,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=107776.66666666667, ans=0.09899494936611666 2023-10-09 19:27:26,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=107776.66666666667, ans=0.1 2023-10-09 19:27:39,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107870.0, ans=0.1 2023-10-09 19:27:40,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.999e+02 2.255e+02 2.541e+02 3.520e+02, threshold=4.511e+02, percent-clipped=0.0 2023-10-09 19:28:00,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=107916.66666666667, ans=0.125 2023-10-09 19:28:19,047 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=15.0 2023-10-09 19:28:28,675 INFO [train.py:1031] (0/4) Epoch 2, batch 9500, loss[loss=0.2855, simple_loss=0.3642, pruned_loss=0.1034, over 16819.00 frames. ], tot_loss[loss=0.2976, simple_loss=0.3645, pruned_loss=0.1154, over 32541804.31 frames. ], batch size: 146, lr: 2.10e-02, grad_scale: 32.0 2023-10-09 19:28:53,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108150.0, ans=0.1 2023-10-09 19:29:38,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 2.174e+02 2.595e+02 3.130e+02 5.286e+02, threshold=5.190e+02, percent-clipped=1.0 2023-10-09 19:29:42,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=108336.66666666667, ans=0.125 2023-10-09 19:30:03,550 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-10-09 19:30:35,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=108570.0, ans=0.0 2023-10-09 19:30:50,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=108616.66666666667, ans=0.0 2023-10-09 19:30:57,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=108663.33333333333, ans=0.125 2023-10-09 19:31:00,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-10-09 19:31:08,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=108710.0, ans=0.125 2023-10-09 19:31:14,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=108710.0, ans=0.1 2023-10-09 19:31:16,694 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-10-09 19:31:22,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=108756.66666666667, ans=0.125 2023-10-09 19:31:31,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 2.017e+02 2.247e+02 2.693e+02 3.556e+02, threshold=4.494e+02, percent-clipped=0.0 2023-10-09 19:31:32,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=108803.33333333333, ans=0.0 2023-10-09 19:31:40,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=108803.33333333333, ans=0.0 2023-10-09 19:31:51,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=108850.0, ans=0.125 2023-10-09 19:31:59,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=108896.66666666667, ans=0.2 2023-10-09 19:32:13,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=108943.33333333333, ans=0.0 2023-10-09 19:32:28,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=109036.66666666667, ans=0.1 2023-10-09 19:32:32,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=109036.66666666667, ans=0.2 2023-10-09 19:32:34,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=109036.66666666667, ans=0.0 2023-10-09 19:32:52,623 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.41 vs. limit=22.5 2023-10-09 19:33:04,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.46 vs. limit=15.0 2023-10-09 19:33:06,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=109176.66666666667, ans=0.025 2023-10-09 19:33:22,921 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.995e+02 2.294e+02 2.600e+02 3.373e+02, threshold=4.588e+02, percent-clipped=0.0 2023-10-09 19:34:21,393 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-10-09 19:34:31,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=109550.0, ans=0.125 2023-10-09 19:34:31,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.14 vs. limit=22.5 2023-10-09 19:34:36,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=109550.0, ans=0.07 2023-10-09 19:34:50,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=109596.66666666667, ans=0.125 2023-10-09 19:34:54,666 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=12.0 2023-10-09 19:34:57,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=109643.33333333333, ans=0.125 2023-10-09 19:35:15,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 2.071e+02 2.274e+02 2.604e+02 3.820e+02, threshold=4.549e+02, percent-clipped=0.0 2023-10-09 19:35:23,575 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:35:30,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=109783.33333333333, ans=0.0 2023-10-09 19:35:35,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=109830.0, ans=0.125 2023-10-09 19:35:36,128 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=12.0 2023-10-09 19:35:53,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=109876.66666666667, ans=0.125 2023-10-09 19:35:57,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=109876.66666666667, ans=0.0 2023-10-09 19:36:01,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=109923.33333333333, ans=0.125 2023-10-09 19:36:09,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=109923.33333333333, ans=0.2 2023-10-09 19:36:16,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=109970.0, ans=0.125 2023-10-09 19:36:17,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=109970.0, ans=0.125 2023-10-09 19:36:21,227 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.14 vs. limit=15.0 2023-10-09 19:36:33,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.99 vs. limit=15.0 2023-10-09 19:36:43,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110110.0, ans=0.1 2023-10-09 19:36:57,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=110156.66666666667, ans=0.5 2023-10-09 19:37:07,038 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.892e+02 2.133e+02 2.333e+02 3.209e+02, threshold=4.267e+02, percent-clipped=0.0 2023-10-09 19:37:09,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=110203.33333333333, ans=0.0 2023-10-09 19:37:16,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=110250.0, ans=0.125 2023-10-09 19:37:43,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=110343.33333333333, ans=0.0 2023-10-09 19:37:48,357 INFO [train.py:1031] (0/4) Epoch 2, batch 10000, loss[loss=0.3117, simple_loss=0.3683, pruned_loss=0.1276, over 16800.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3624, pruned_loss=0.1142, over 32551294.90 frames. ], batch size: 146, lr: 2.08e-02, grad_scale: 32.0 2023-10-09 19:38:23,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=110530.0, ans=0.0 2023-10-09 19:38:52,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=110670.0, ans=0.0 2023-10-09 19:38:56,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.126e+02 2.323e+02 2.635e+02 3.788e+02, threshold=4.646e+02, percent-clipped=0.0 2023-10-09 19:39:16,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=110716.66666666667, ans=0.0 2023-10-09 19:39:23,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=110763.33333333333, ans=0.125 2023-10-09 19:39:26,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=110763.33333333333, ans=0.125 2023-10-09 19:39:44,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=110856.66666666667, ans=0.125 2023-10-09 19:40:12,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=110950.0, ans=15.0 2023-10-09 19:40:14,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=110950.0, ans=0.125 2023-10-09 19:40:33,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=111043.33333333333, ans=0.125 2023-10-09 19:40:36,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=111043.33333333333, ans=0.0 2023-10-09 19:40:46,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=111090.0, ans=0.02 2023-10-09 19:40:50,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.043e+02 2.430e+02 2.863e+02 4.874e+02, threshold=4.859e+02, percent-clipped=1.0 2023-10-09 19:41:04,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=111183.33333333333, ans=0.0 2023-10-09 19:41:21,129 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.41 vs. limit=15.0 2023-10-09 19:41:31,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=111323.33333333333, ans=0.1 2023-10-09 19:41:40,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=111323.33333333333, ans=0.125 2023-10-09 19:41:43,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=111323.33333333333, ans=0.1 2023-10-09 19:41:58,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=34.67 vs. limit=15.0 2023-10-09 19:42:05,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=111416.66666666667, ans=0.0 2023-10-09 19:42:07,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=111416.66666666667, ans=0.125 2023-10-09 19:42:10,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=111463.33333333333, ans=0.09899494936611666 2023-10-09 19:42:19,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=111510.0, ans=0.1 2023-10-09 19:42:44,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 2.080e+02 2.350e+02 2.749e+02 4.092e+02, threshold=4.700e+02, percent-clipped=0.0 2023-10-09 19:42:44,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.90 vs. limit=22.5 2023-10-09 19:43:02,493 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-10-09 19:43:18,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=111743.33333333333, ans=0.0 2023-10-09 19:43:21,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.99 vs. limit=15.0 2023-10-09 19:44:08,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=111930.0, ans=0.2 2023-10-09 19:44:20,679 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-24000.pt 2023-10-09 19:44:24,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=111976.66666666667, ans=0.2 2023-10-09 19:44:38,464 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.25 vs. limit=22.5 2023-10-09 19:44:40,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=112023.33333333333, ans=0.0 2023-10-09 19:44:44,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.935e+02 2.314e+02 2.574e+02 4.222e+02, threshold=4.628e+02, percent-clipped=0.0 2023-10-09 19:44:45,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=112070.0, ans=0.5 2023-10-09 19:45:10,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=112163.33333333333, ans=0.2 2023-10-09 19:45:20,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=112210.0, ans=0.125 2023-10-09 19:45:38,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=112256.66666666667, ans=0.125 2023-10-09 19:45:45,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=112303.33333333333, ans=0.0 2023-10-09 19:45:52,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=112350.0, ans=0.125 2023-10-09 19:45:57,484 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 19:46:32,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=112490.0, ans=0.125 2023-10-09 19:46:41,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.102e+02 2.406e+02 2.971e+02 5.102e+02, threshold=4.812e+02, percent-clipped=2.0 2023-10-09 19:46:50,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=112583.33333333333, ans=0.0 2023-10-09 19:47:02,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=112630.0, ans=0.0 2023-10-09 19:47:08,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=112630.0, ans=0.0 2023-10-09 19:47:20,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=112676.66666666667, ans=0.0 2023-10-09 19:47:22,777 INFO [train.py:1031] (0/4) Epoch 2, batch 10500, loss[loss=0.2887, simple_loss=0.3665, pruned_loss=0.1054, over 16865.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.362, pruned_loss=0.1137, over 32553130.12 frames. ], batch size: 104, lr: 2.06e-02, grad_scale: 32.0 2023-10-09 19:47:40,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=112770.0, ans=0.0 2023-10-09 19:47:41,826 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.06 vs. limit=22.5 2023-10-09 19:48:07,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=112910.0, ans=0.015 2023-10-09 19:48:12,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=112910.0, ans=0.07 2023-10-09 19:48:13,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=112910.0, ans=0.0 2023-10-09 19:48:16,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=112956.66666666667, ans=15.0 2023-10-09 19:48:20,465 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.36 vs. limit=15.0 2023-10-09 19:48:28,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=22.5 2023-10-09 19:48:30,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 2.013e+02 2.301e+02 2.879e+02 4.795e+02, threshold=4.602e+02, percent-clipped=0.0 2023-10-09 19:48:31,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=113003.33333333333, ans=0.1 2023-10-09 19:48:34,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113003.33333333333, ans=0.1 2023-10-09 19:48:59,602 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.42 vs. limit=22.5 2023-10-09 19:49:03,275 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-10-09 19:49:09,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-10-09 19:49:17,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=113143.33333333333, ans=0.125 2023-10-09 19:49:30,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113236.66666666667, ans=0.1 2023-10-09 19:49:37,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=113236.66666666667, ans=0.125 2023-10-09 19:49:58,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=113330.0, ans=0.0 2023-10-09 19:50:11,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=113376.66666666667, ans=0.025 2023-10-09 19:50:16,931 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.25 vs. limit=22.5 2023-10-09 19:50:19,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=113423.33333333333, ans=0.2 2023-10-09 19:50:28,271 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 2.095e+02 2.369e+02 2.692e+02 5.169e+02, threshold=4.738e+02, percent-clipped=1.0 2023-10-09 19:50:42,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=113516.66666666667, ans=0.125 2023-10-09 19:50:46,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.52 vs. limit=15.0 2023-10-09 19:50:54,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=113563.33333333333, ans=0.0 2023-10-09 19:50:59,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=113563.33333333333, ans=0.125 2023-10-09 19:51:13,978 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.84 vs. limit=15.0 2023-10-09 19:51:32,400 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=12.0 2023-10-09 19:51:43,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=113750.0, ans=0.0 2023-10-09 19:51:47,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=113796.66666666667, ans=0.07 2023-10-09 19:51:54,357 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.00 vs. limit=15.0 2023-10-09 19:51:59,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=113843.33333333333, ans=0.0 2023-10-09 19:52:03,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=113843.33333333333, ans=0.125 2023-10-09 19:52:06,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=113843.33333333333, ans=0.125 2023-10-09 19:52:21,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 2.067e+02 2.328e+02 2.705e+02 6.123e+02, threshold=4.656e+02, percent-clipped=1.0 2023-10-09 19:52:38,316 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-10-09 19:52:39,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.52 vs. limit=15.0 2023-10-09 19:52:44,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114030.0, ans=0.1 2023-10-09 19:53:29,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=114216.66666666667, ans=0.125 2023-10-09 19:53:51,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=114310.0, ans=0.125 2023-10-09 19:53:56,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.50 vs. limit=15.0 2023-10-09 19:53:57,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=114356.66666666667, ans=6.0 2023-10-09 19:54:04,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=114356.66666666667, ans=0.2 2023-10-09 19:54:09,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.704e+02 2.018e+02 2.376e+02 2.691e+02 4.537e+02, threshold=4.752e+02, percent-clipped=0.0 2023-10-09 19:54:11,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=114403.33333333333, ans=0.0 2023-10-09 19:54:27,760 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.38 vs. limit=15.0 2023-10-09 19:54:39,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114496.66666666667, ans=0.1 2023-10-09 19:55:06,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.57 vs. limit=15.0 2023-10-09 19:55:45,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=114823.33333333333, ans=0.125 2023-10-09 19:55:53,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=114823.33333333333, ans=0.125 2023-10-09 19:55:57,351 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.956e+02 2.232e+02 2.602e+02 3.731e+02, threshold=4.465e+02, percent-clipped=0.0 2023-10-09 19:56:00,360 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-10-09 19:56:15,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=114963.33333333333, ans=0.0 2023-10-09 19:56:19,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=114963.33333333333, ans=10.0 2023-10-09 19:56:33,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=115010.0, ans=0.0 2023-10-09 19:56:37,464 INFO [train.py:1031] (0/4) Epoch 2, batch 11000, loss[loss=0.3134, simple_loss=0.3691, pruned_loss=0.1289, over 15418.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3616, pruned_loss=0.1133, over 32604090.13 frames. ], batch size: 35, lr: 2.04e-02, grad_scale: 32.0 2023-10-09 19:57:37,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.14 vs. limit=15.0 2023-10-09 19:57:45,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=115336.66666666667, ans=0.0 2023-10-09 19:57:49,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.272e+02 2.627e+02 3.107e+02 4.248e+02, threshold=5.254e+02, percent-clipped=0.0 2023-10-09 19:57:57,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=115383.33333333333, ans=0.125 2023-10-09 19:58:02,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=115383.33333333333, ans=0.125 2023-10-09 19:58:10,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=115430.0, ans=0.0 2023-10-09 19:58:40,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115523.33333333333, ans=0.1 2023-10-09 19:58:51,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=115570.0, ans=0.05 2023-10-09 19:59:01,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=115616.66666666667, ans=0.125 2023-10-09 19:59:01,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=115616.66666666667, ans=0.125 2023-10-09 19:59:04,437 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-10-09 19:59:09,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115663.33333333333, ans=0.1 2023-10-09 19:59:39,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.95 vs. limit=22.5 2023-10-09 19:59:49,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=115803.33333333333, ans=0.125 2023-10-09 19:59:49,870 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 2.048e+02 2.276e+02 2.586e+02 3.586e+02, threshold=4.553e+02, percent-clipped=0.0 2023-10-09 19:59:56,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=115850.0, ans=0.125 2023-10-09 20:00:00,630 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:00:00,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=115850.0, ans=0.125 2023-10-09 20:00:05,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=115850.0, ans=0.2 2023-10-09 20:00:20,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=115943.33333333333, ans=0.125 2023-10-09 20:00:23,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=115943.33333333333, ans=0.0 2023-10-09 20:00:24,916 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.18 vs. limit=22.5 2023-10-09 20:00:25,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.86 vs. limit=15.0 2023-10-09 20:00:37,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=115990.0, ans=0.0 2023-10-09 20:00:47,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=116036.66666666667, ans=0.0 2023-10-09 20:00:58,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=116083.33333333333, ans=0.125 2023-10-09 20:01:01,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=116130.0, ans=0.125 2023-10-09 20:01:26,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=116223.33333333333, ans=0.05 2023-10-09 20:01:36,853 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 2.054e+02 2.344e+02 2.826e+02 4.170e+02, threshold=4.688e+02, percent-clipped=0.0 2023-10-09 20:01:37,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=116270.0, ans=0.125 2023-10-09 20:01:53,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=116316.66666666667, ans=0.125 2023-10-09 20:02:02,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=116363.33333333333, ans=0.125 2023-10-09 20:02:08,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=116363.33333333333, ans=0.0 2023-10-09 20:02:15,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=116410.0, ans=0.2 2023-10-09 20:02:20,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=116410.0, ans=0.1 2023-10-09 20:02:23,065 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:02:31,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=116456.66666666667, ans=0.0 2023-10-09 20:02:35,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=116456.66666666667, ans=0.125 2023-10-09 20:02:44,604 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.64 vs. limit=6.0 2023-10-09 20:02:47,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=116550.0, ans=0.125 2023-10-09 20:02:57,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=116550.0, ans=0.1 2023-10-09 20:03:15,111 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.16 vs. limit=12.0 2023-10-09 20:03:17,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=116643.33333333333, ans=0.125 2023-10-09 20:03:18,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=116643.33333333333, ans=0.2 2023-10-09 20:03:31,340 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:03:37,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=116736.66666666667, ans=0.0 2023-10-09 20:03:37,686 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.968e+02 2.234e+02 2.630e+02 3.849e+02, threshold=4.469e+02, percent-clipped=0.0 2023-10-09 20:03:52,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=116783.33333333333, ans=0.125 2023-10-09 20:03:52,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=116783.33333333333, ans=0.0 2023-10-09 20:03:56,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=116830.0, ans=0.125 2023-10-09 20:03:56,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=116830.0, ans=0.125 2023-10-09 20:03:58,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=116830.0, ans=0.125 2023-10-09 20:03:58,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.38 vs. limit=10.0 2023-10-09 20:05:08,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=117110.0, ans=0.0 2023-10-09 20:05:16,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=117110.0, ans=0.05 2023-10-09 20:05:26,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.81 vs. limit=12.0 2023-10-09 20:05:29,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=117156.66666666667, ans=10.0 2023-10-09 20:05:33,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=117203.33333333333, ans=0.0 2023-10-09 20:05:34,705 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.222e+02 2.512e+02 2.981e+02 4.436e+02, threshold=5.024e+02, percent-clipped=0.0 2023-10-09 20:05:36,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117203.33333333333, ans=0.1 2023-10-09 20:05:58,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.96 vs. limit=15.0 2023-10-09 20:06:05,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=117296.66666666667, ans=15.0 2023-10-09 20:06:07,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=117343.33333333333, ans=0.125 2023-10-09 20:06:07,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=117343.33333333333, ans=0.1 2023-10-09 20:06:08,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=117343.33333333333, ans=0.0 2023-10-09 20:06:21,496 INFO [train.py:1031] (0/4) Epoch 2, batch 11500, loss[loss=0.2739, simple_loss=0.3503, pruned_loss=0.09877, over 16547.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3606, pruned_loss=0.1125, over 32673078.21 frames. ], batch size: 56, lr: 2.02e-02, grad_scale: 16.0 2023-10-09 20:06:24,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=117390.0, ans=0.125 2023-10-09 20:06:25,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=117390.0, ans=0.0 2023-10-09 20:06:26,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117390.0, ans=0.1 2023-10-09 20:07:18,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.74 vs. limit=15.0 2023-10-09 20:07:21,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=117576.66666666667, ans=0.125 2023-10-09 20:07:29,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=117623.33333333333, ans=0.125 2023-10-09 20:07:34,519 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.46 vs. limit=15.0 2023-10-09 20:07:39,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 2.136e+02 2.363e+02 2.771e+02 4.950e+02, threshold=4.727e+02, percent-clipped=0.0 2023-10-09 20:07:43,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=117670.0, ans=0.07 2023-10-09 20:07:47,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=117716.66666666667, ans=0.1 2023-10-09 20:08:12,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=117810.0, ans=0.0 2023-10-09 20:08:45,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=117903.33333333333, ans=0.0 2023-10-09 20:08:52,066 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.27 vs. limit=6.0 2023-10-09 20:09:07,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=117996.66666666667, ans=0.125 2023-10-09 20:09:21,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=118090.0, ans=0.0 2023-10-09 20:09:35,829 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.974e+02 2.221e+02 2.547e+02 3.826e+02, threshold=4.443e+02, percent-clipped=0.0 2023-10-09 20:09:36,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=118136.66666666667, ans=0.2 2023-10-09 20:09:42,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=118183.33333333333, ans=0.125 2023-10-09 20:09:45,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=118183.33333333333, ans=0.0 2023-10-09 20:09:49,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=118183.33333333333, ans=0.0 2023-10-09 20:09:50,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=118183.33333333333, ans=0.0 2023-10-09 20:09:51,385 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.17 vs. limit=15.0 2023-10-09 20:10:05,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=118276.66666666667, ans=0.0 2023-10-09 20:10:14,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=118323.33333333333, ans=0.5 2023-10-09 20:10:14,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=118323.33333333333, ans=0.125 2023-10-09 20:10:23,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=118323.33333333333, ans=0.125 2023-10-09 20:10:29,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118370.0, ans=0.1 2023-10-09 20:10:30,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=118370.0, ans=0.1 2023-10-09 20:10:51,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=118463.33333333333, ans=0.2 2023-10-09 20:11:08,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=118510.0, ans=15.0 2023-10-09 20:11:18,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=118556.66666666667, ans=0.125 2023-10-09 20:11:21,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=118556.66666666667, ans=0.09899494936611666 2023-10-09 20:11:23,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-10-09 20:11:37,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 2.082e+02 2.308e+02 2.745e+02 4.017e+02, threshold=4.616e+02, percent-clipped=0.0 2023-10-09 20:12:00,417 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:12:15,040 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.49 vs. limit=22.5 2023-10-09 20:13:10,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=118930.0, ans=0.125 2023-10-09 20:13:12,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=118930.0, ans=0.125 2023-10-09 20:13:16,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=118976.66666666667, ans=0.05 2023-10-09 20:13:18,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=118976.66666666667, ans=0.0 2023-10-09 20:13:33,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.07 vs. limit=15.0 2023-10-09 20:13:44,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 2.042e+02 2.316e+02 2.818e+02 4.963e+02, threshold=4.633e+02, percent-clipped=2.0 2023-10-09 20:14:03,643 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.71 vs. limit=15.0 2023-10-09 20:14:04,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=119163.33333333333, ans=0.125 2023-10-09 20:14:06,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119163.33333333333, ans=0.1 2023-10-09 20:14:14,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=119163.33333333333, ans=0.2 2023-10-09 20:14:21,733 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=22.5 2023-10-09 20:14:31,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=119256.66666666667, ans=0.04949747468305833 2023-10-09 20:15:01,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=119350.0, ans=0.2 2023-10-09 20:15:04,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=119396.66666666667, ans=0.0 2023-10-09 20:15:13,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=119396.66666666667, ans=0.125 2023-10-09 20:15:31,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=119490.0, ans=0.0 2023-10-09 20:15:31,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=119490.0, ans=0.0 2023-10-09 20:15:38,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=119490.0, ans=0.125 2023-10-09 20:15:44,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 2.049e+02 2.330e+02 2.616e+02 3.748e+02, threshold=4.659e+02, percent-clipped=0.0 2023-10-09 20:15:50,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=119583.33333333333, ans=0.2 2023-10-09 20:15:59,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=119583.33333333333, ans=0.1 2023-10-09 20:16:24,578 INFO [train.py:1031] (0/4) Epoch 2, batch 12000, loss[loss=0.3025, simple_loss=0.3755, pruned_loss=0.1148, over 16913.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3602, pruned_loss=0.1119, over 32701338.75 frames. ], batch size: 138, lr: 2.00e-02, grad_scale: 32.0 2023-10-09 20:16:35,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=119723.33333333333, ans=0.125 2023-10-09 20:16:37,876 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-10-09 20:16:57,235 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:17:07,287 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:17:09,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=119910.0, ans=0.125 2023-10-09 20:17:14,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=119910.0, ans=0.1 2023-10-09 20:17:21,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=119910.0, ans=0.0 2023-10-09 20:17:38,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.66 vs. limit=15.0 2023-10-09 20:17:39,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 2.021e+02 2.208e+02 2.660e+02 4.127e+02, threshold=4.416e+02, percent-clipped=0.0 2023-10-09 20:17:51,794 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-10-09 20:18:06,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=120096.66666666667, ans=0.0 2023-10-09 20:18:09,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=120143.33333333333, ans=0.125 2023-10-09 20:18:30,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=120190.0, ans=0.125 2023-10-09 20:18:32,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=120236.66666666667, ans=0.07 2023-10-09 20:18:42,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=120236.66666666667, ans=0.1 2023-10-09 20:18:43,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=120283.33333333333, ans=0.125 2023-10-09 20:18:50,492 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.65 vs. limit=15.0 2023-10-09 20:18:59,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=120330.0, ans=0.04949747468305833 2023-10-09 20:19:00,900 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.07 vs. limit=15.0 2023-10-09 20:19:35,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.934e+02 2.225e+02 2.573e+02 3.904e+02, threshold=4.450e+02, percent-clipped=0.0 2023-10-09 20:19:58,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=120563.33333333333, ans=0.07 2023-10-09 20:19:59,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=120563.33333333333, ans=0.125 2023-10-09 20:20:19,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=120656.66666666667, ans=0.09899494936611666 2023-10-09 20:20:20,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=120656.66666666667, ans=0.0 2023-10-09 20:20:22,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=120656.66666666667, ans=0.0 2023-10-09 20:20:25,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=120703.33333333333, ans=0.125 2023-10-09 20:20:44,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-10-09 20:20:45,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=120750.0, ans=0.125 2023-10-09 20:20:46,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=120750.0, ans=0.07 2023-10-09 20:20:57,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.95 vs. limit=22.5 2023-10-09 20:21:27,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 2.151e+02 2.507e+02 3.017e+02 4.295e+02, threshold=5.014e+02, percent-clipped=0.0 2023-10-09 20:21:30,637 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-10-09 20:21:31,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=120936.66666666667, ans=0.125 2023-10-09 20:21:52,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=121030.0, ans=0.0 2023-10-09 20:22:10,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=121123.33333333333, ans=0.0 2023-10-09 20:22:16,901 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:22:29,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-10-09 20:22:40,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=121263.33333333333, ans=0.125 2023-10-09 20:22:41,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=121263.33333333333, ans=0.05 2023-10-09 20:23:10,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=121356.66666666667, ans=0.125 2023-10-09 20:23:14,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=121356.66666666667, ans=0.95 2023-10-09 20:23:19,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=121403.33333333333, ans=0.0 2023-10-09 20:23:23,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 1.964e+02 2.147e+02 2.424e+02 3.507e+02, threshold=4.294e+02, percent-clipped=0.0 2023-10-09 20:23:43,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121496.66666666667, ans=0.1 2023-10-09 20:23:49,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=121496.66666666667, ans=0.125 2023-10-09 20:24:06,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=121590.0, ans=0.0 2023-10-09 20:24:07,551 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:24:30,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=121683.33333333333, ans=0.125 2023-10-09 20:24:54,972 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-10-09 20:24:56,018 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.29 vs. limit=12.0 2023-10-09 20:24:57,291 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.89 vs. limit=22.5 2023-10-09 20:25:00,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=121776.66666666667, ans=0.5 2023-10-09 20:25:10,408 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-10-09 20:25:17,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=121870.0, ans=0.0 2023-10-09 20:25:19,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121870.0, ans=0.1 2023-10-09 20:25:22,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.118e+02 2.431e+02 2.818e+02 3.970e+02, threshold=4.862e+02, percent-clipped=0.0 2023-10-09 20:25:40,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=121963.33333333333, ans=0.0 2023-10-09 20:25:43,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=121963.33333333333, ans=0.125 2023-10-09 20:25:45,398 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.15 vs. limit=10.0 2023-10-09 20:26:04,607 INFO [train.py:1031] (0/4) Epoch 2, batch 12500, loss[loss=0.2989, simple_loss=0.3671, pruned_loss=0.1154, over 16896.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3593, pruned_loss=0.1113, over 32738836.50 frames. ], batch size: 72, lr: 1.99e-02, grad_scale: 32.0 2023-10-09 20:26:18,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=122103.33333333333, ans=0.015 2023-10-09 20:26:41,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.28 vs. limit=12.0 2023-10-09 20:27:14,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.988e+02 2.265e+02 2.583e+02 3.799e+02, threshold=4.530e+02, percent-clipped=0.0 2023-10-09 20:27:16,715 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.34 vs. limit=15.0 2023-10-09 20:27:26,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=122383.33333333333, ans=0.0 2023-10-09 20:27:53,502 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:27:53,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=122476.66666666667, ans=0.125 2023-10-09 20:28:38,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=122663.33333333333, ans=0.0 2023-10-09 20:28:55,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122756.66666666667, ans=0.1 2023-10-09 20:29:02,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=122756.66666666667, ans=0.0 2023-10-09 20:29:05,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=122756.66666666667, ans=0.2 2023-10-09 20:29:09,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=122803.33333333333, ans=0.0 2023-10-09 20:29:13,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.990e+02 2.305e+02 2.591e+02 3.767e+02, threshold=4.610e+02, percent-clipped=0.0 2023-10-09 20:29:20,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=122850.0, ans=0.125 2023-10-09 20:29:34,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=122896.66666666667, ans=0.125 2023-10-09 20:29:47,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=122943.33333333333, ans=0.95 2023-10-09 20:30:01,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.15 vs. limit=22.5 2023-10-09 20:30:04,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=123036.66666666667, ans=0.0 2023-10-09 20:30:06,683 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:30:20,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.09 vs. limit=22.5 2023-10-09 20:30:27,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=123130.0, ans=0.125 2023-10-09 20:30:36,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=123130.0, ans=0.07 2023-10-09 20:30:36,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=123130.0, ans=0.1 2023-10-09 20:31:01,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=123223.33333333333, ans=0.2 2023-10-09 20:31:08,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.040e+02 2.323e+02 2.740e+02 4.359e+02, threshold=4.647e+02, percent-clipped=0.0 2023-10-09 20:31:52,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=123456.66666666667, ans=0.025 2023-10-09 20:31:53,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=123456.66666666667, ans=0.0 2023-10-09 20:31:58,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=123456.66666666667, ans=0.2 2023-10-09 20:32:00,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=123503.33333333333, ans=0.0 2023-10-09 20:32:06,327 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:32:15,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.45 vs. limit=15.0 2023-10-09 20:32:44,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=123643.33333333333, ans=0.125 2023-10-09 20:32:52,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=123690.0, ans=0.125 2023-10-09 20:33:03,623 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.75 vs. limit=15.0 2023-10-09 20:33:08,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 2.059e+02 2.332e+02 2.748e+02 3.985e+02, threshold=4.664e+02, percent-clipped=0.0 2023-10-09 20:33:08,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=123736.66666666667, ans=0.0 2023-10-09 20:33:20,896 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.81 vs. limit=6.0 2023-10-09 20:33:31,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=123830.0, ans=10.0 2023-10-09 20:33:51,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=123923.33333333333, ans=0.125 2023-10-09 20:34:03,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.57 vs. limit=15.0 2023-10-09 20:34:08,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123970.0, ans=0.1 2023-10-09 20:34:20,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.05 vs. limit=22.5 2023-10-09 20:34:38,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=124110.0, ans=0.125 2023-10-09 20:34:44,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.40 vs. limit=10.0 2023-10-09 20:35:00,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=124156.66666666667, ans=0.0 2023-10-09 20:35:04,264 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:35:09,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 2.043e+02 2.308e+02 2.671e+02 3.646e+02, threshold=4.616e+02, percent-clipped=0.0 2023-10-09 20:35:12,731 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:35:15,149 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.19 vs. limit=12.0 2023-10-09 20:35:16,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=124250.0, ans=0.125 2023-10-09 20:35:40,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=124343.33333333333, ans=0.125 2023-10-09 20:35:50,393 INFO [train.py:1031] (0/4) Epoch 2, batch 13000, loss[loss=0.2633, simple_loss=0.3442, pruned_loss=0.09122, over 16852.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3594, pruned_loss=0.1108, over 32771793.66 frames. ], batch size: 98, lr: 1.97e-02, grad_scale: 32.0 2023-10-09 20:36:06,423 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=15.0 2023-10-09 20:36:24,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=124483.33333333333, ans=0.125 2023-10-09 20:36:33,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=124530.0, ans=0.2 2023-10-09 20:36:44,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=124576.66666666667, ans=0.0 2023-10-09 20:36:50,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=124576.66666666667, ans=0.125 2023-10-09 20:37:14,129 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.64 vs. limit=22.5 2023-10-09 20:37:18,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.942e+02 2.218e+02 2.568e+02 3.708e+02, threshold=4.436e+02, percent-clipped=0.0 2023-10-09 20:37:24,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=124670.0, ans=0.0 2023-10-09 20:38:03,546 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.82 vs. limit=15.0 2023-10-09 20:38:17,656 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-10-09 20:38:26,034 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:38:38,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=124950.0, ans=0.125 2023-10-09 20:38:40,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=124996.66666666667, ans=0.125 2023-10-09 20:39:18,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 2.039e+02 2.307e+02 2.809e+02 4.569e+02, threshold=4.613e+02, percent-clipped=1.0 2023-10-09 20:39:26,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=125183.33333333333, ans=0.125 2023-10-09 20:39:57,817 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.30 vs. limit=15.0 2023-10-09 20:40:05,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=125323.33333333333, ans=0.0 2023-10-09 20:40:09,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=125323.33333333333, ans=0.2 2023-10-09 20:40:25,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=125370.0, ans=0.125 2023-10-09 20:40:32,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=125416.66666666667, ans=0.2 2023-10-09 20:40:32,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.83 vs. limit=15.0 2023-10-09 20:40:41,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=125463.33333333333, ans=0.125 2023-10-09 20:41:10,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=125603.33333333333, ans=0.125 2023-10-09 20:41:16,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 2.138e+02 2.441e+02 2.956e+02 4.858e+02, threshold=4.881e+02, percent-clipped=1.0 2023-10-09 20:41:23,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125650.0, ans=0.1 2023-10-09 20:41:23,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=125650.0, ans=0.1 2023-10-09 20:41:25,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=22.5 2023-10-09 20:41:28,711 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.29 vs. limit=6.0 2023-10-09 20:41:48,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=125743.33333333333, ans=0.2 2023-10-09 20:41:54,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=125743.33333333333, ans=0.125 2023-10-09 20:42:15,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.91 vs. limit=6.0 2023-10-09 20:42:29,016 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.50 vs. limit=22.5 2023-10-09 20:42:30,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.08 vs. limit=12.0 2023-10-09 20:42:36,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=125930.0, ans=0.125 2023-10-09 20:42:38,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=125930.0, ans=15.0 2023-10-09 20:42:40,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=125976.66666666667, ans=0.0 2023-10-09 20:42:51,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=126023.33333333333, ans=0.1 2023-10-09 20:43:06,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 2.039e+02 2.341e+02 2.618e+02 4.220e+02, threshold=4.682e+02, percent-clipped=0.0 2023-10-09 20:43:10,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=126116.66666666667, ans=0.125 2023-10-09 20:43:16,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=126116.66666666667, ans=0.125 2023-10-09 20:43:30,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=126163.33333333333, ans=0.05 2023-10-09 20:43:35,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=126210.0, ans=0.125 2023-10-09 20:43:58,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=126303.33333333333, ans=0.125 2023-10-09 20:44:05,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.68 vs. limit=15.0 2023-10-09 20:44:20,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=126396.66666666667, ans=0.04949747468305833 2023-10-09 20:44:26,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=126396.66666666667, ans=0.125 2023-10-09 20:44:29,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=126443.33333333333, ans=0.0 2023-10-09 20:44:33,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=126443.33333333333, ans=0.2 2023-10-09 20:44:42,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=126490.0, ans=0.0 2023-10-09 20:44:47,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=126490.0, ans=0.09899494936611666 2023-10-09 20:44:49,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=126536.66666666667, ans=0.1 2023-10-09 20:44:51,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=126536.66666666667, ans=0.125 2023-10-09 20:44:55,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.995e+02 2.342e+02 2.736e+02 3.888e+02, threshold=4.684e+02, percent-clipped=0.0 2023-10-09 20:45:02,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=126583.33333333333, ans=0.07 2023-10-09 20:45:31,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=126676.66666666667, ans=0.0 2023-10-09 20:45:34,110 INFO [train.py:1031] (0/4) Epoch 2, batch 13500, loss[loss=0.2926, simple_loss=0.3529, pruned_loss=0.1162, over 16653.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3582, pruned_loss=0.1103, over 32769250.25 frames. ], batch size: 56, lr: 1.95e-02, grad_scale: 64.0 2023-10-09 20:46:06,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=126816.66666666667, ans=0.125 2023-10-09 20:46:29,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126956.66666666667, ans=0.1 2023-10-09 20:46:40,877 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.61 vs. limit=15.0 2023-10-09 20:46:41,852 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.16 vs. limit=15.0 2023-10-09 20:46:42,632 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 20:46:43,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=127003.33333333333, ans=0.0 2023-10-09 20:46:46,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 2.076e+02 2.424e+02 2.712e+02 4.296e+02, threshold=4.848e+02, percent-clipped=0.0 2023-10-09 20:46:51,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=127050.0, ans=15.0 2023-10-09 20:47:16,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=127143.33333333333, ans=0.1 2023-10-09 20:47:16,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=127143.33333333333, ans=0.125 2023-10-09 20:47:18,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=127143.33333333333, ans=0.125 2023-10-09 20:47:21,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=127143.33333333333, ans=0.0 2023-10-09 20:47:26,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=127190.0, ans=0.125 2023-10-09 20:47:39,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.90 vs. limit=15.0 2023-10-09 20:47:44,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=127236.66666666667, ans=0.125 2023-10-09 20:48:03,279 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=15.0 2023-10-09 20:48:19,234 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-2.pt 2023-10-09 20:48:49,784 INFO [train.py:1031] (0/4) Epoch 3, batch 0, loss[loss=0.2552, simple_loss=0.3305, pruned_loss=0.08995, over 16395.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3305, pruned_loss=0.08995, over 16395.00 frames. ], batch size: 50, lr: 1.55e-02, grad_scale: 32.0 2023-10-09 20:48:49,785 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-09 20:49:03,949 INFO [train.py:1063] (0/4) Epoch 3, validation: loss=0.2699, simple_loss=0.3526, pruned_loss=0.09359, over 1020973.00 frames. 2023-10-09 20:49:03,950 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-09 20:49:18,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.087e+02 2.261e+02 2.637e+02 4.213e+02, threshold=4.522e+02, percent-clipped=0.0 2023-10-09 20:49:46,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=127586.66666666667, ans=0.125 2023-10-09 20:49:51,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127586.66666666667, ans=0.1 2023-10-09 20:49:54,164 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-10-09 20:50:19,674 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=15.0 2023-10-09 20:50:26,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=127726.66666666667, ans=0.125 2023-10-09 20:50:26,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=127726.66666666667, ans=0.04949747468305833 2023-10-09 20:50:51,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=127866.66666666667, ans=0.2 2023-10-09 20:51:06,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-10-09 20:51:12,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.859e+02 2.073e+02 2.445e+02 3.112e+02, threshold=4.147e+02, percent-clipped=0.0 2023-10-09 20:51:17,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127960.0, ans=0.1 2023-10-09 20:51:22,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=127960.0, ans=0.125 2023-10-09 20:51:32,099 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-10-09 20:52:09,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=128193.33333333333, ans=0.0 2023-10-09 20:52:11,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=128193.33333333333, ans=0.125 2023-10-09 20:52:23,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=128240.0, ans=0.125 2023-10-09 20:52:28,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=128240.0, ans=0.0 2023-10-09 20:52:55,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=128380.0, ans=0.125 2023-10-09 20:53:04,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.883e+02 2.159e+02 2.558e+02 4.049e+02, threshold=4.317e+02, percent-clipped=0.0 2023-10-09 20:53:05,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=128426.66666666667, ans=0.125 2023-10-09 20:53:17,861 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.61 vs. limit=15.0 2023-10-09 20:53:33,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=128520.0, ans=0.2 2023-10-09 20:53:33,931 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-10-09 20:53:43,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-10-09 20:53:47,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=128566.66666666667, ans=0.0 2023-10-09 20:54:03,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.98 vs. limit=22.5 2023-10-09 20:54:07,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.07 vs. limit=15.0 2023-10-09 20:54:15,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=128706.66666666667, ans=0.0 2023-10-09 20:54:27,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=128753.33333333333, ans=0.0 2023-10-09 20:54:28,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=128753.33333333333, ans=0.1 2023-10-09 20:54:34,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=128753.33333333333, ans=0.0 2023-10-09 20:54:46,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=128800.0, ans=0.125 2023-10-09 20:54:54,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=128846.66666666667, ans=0.125 2023-10-09 20:54:54,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128846.66666666667, ans=0.1 2023-10-09 20:55:00,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.873e+02 2.047e+02 2.304e+02 3.922e+02, threshold=4.094e+02, percent-clipped=0.0 2023-10-09 20:55:01,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128893.33333333333, ans=0.1 2023-10-09 20:55:39,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=129033.33333333333, ans=0.1 2023-10-09 20:55:41,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-10-09 20:56:09,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=129173.33333333333, ans=0.2 2023-10-09 20:56:09,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=129173.33333333333, ans=0.125 2023-10-09 20:56:11,984 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.70 vs. limit=15.0 2023-10-09 20:56:26,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=1.99 vs. limit=15.0 2023-10-09 20:56:37,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=129266.66666666667, ans=0.125 2023-10-09 20:56:43,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=129313.33333333333, ans=0.125 2023-10-09 20:56:48,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=129313.33333333333, ans=0.1 2023-10-09 20:56:53,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=129313.33333333333, ans=0.125 2023-10-09 20:56:54,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.916e+02 2.154e+02 2.409e+02 3.347e+02, threshold=4.309e+02, percent-clipped=0.0 2023-10-09 20:56:59,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129360.0, ans=0.1 2023-10-09 20:57:22,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2023-10-09 20:57:22,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=129453.33333333333, ans=0.09899494936611666 2023-10-09 20:57:36,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=129500.0, ans=0.0 2023-10-09 20:57:46,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=129546.66666666667, ans=0.0 2023-10-09 20:58:00,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=129593.33333333333, ans=0.125 2023-10-09 20:58:23,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=129686.66666666667, ans=0.09899494936611666 2023-10-09 20:58:36,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=129733.33333333333, ans=0.1 2023-10-09 20:58:41,167 INFO [train.py:1031] (0/4) Epoch 3, batch 500, loss[loss=0.2379, simple_loss=0.322, pruned_loss=0.07684, over 16814.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3498, pruned_loss=0.1033, over 7269442.02 frames. ], batch size: 175, lr: 1.54e-02, grad_scale: 32.0 2023-10-09 20:58:51,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.875e+02 2.153e+02 2.502e+02 4.004e+02, threshold=4.305e+02, percent-clipped=0.0 2023-10-09 20:59:00,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.71 vs. limit=6.0 2023-10-09 20:59:14,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-10-09 20:59:18,991 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.26 vs. limit=15.0 2023-10-09 20:59:20,503 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=22.5 2023-10-09 20:59:31,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=129966.66666666667, ans=0.125 2023-10-09 20:59:40,207 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.19 vs. limit=15.0 2023-10-09 21:00:05,863 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.14 vs. limit=6.0 2023-10-09 21:00:15,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=130153.33333333333, ans=0.2 2023-10-09 21:00:23,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-10-09 21:00:23,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=130200.0, ans=15.0 2023-10-09 21:00:24,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2023-10-09 21:00:33,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=130246.66666666667, ans=0.125 2023-10-09 21:00:42,874 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.868e+02 2.123e+02 2.425e+02 4.212e+02, threshold=4.246e+02, percent-clipped=0.0 2023-10-09 21:00:43,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.77 vs. limit=15.0 2023-10-09 21:00:50,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=130293.33333333333, ans=0.125 2023-10-09 21:01:12,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130386.66666666667, ans=0.1 2023-10-09 21:01:20,700 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-10-09 21:01:48,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=130526.66666666667, ans=0.0 2023-10-09 21:01:48,519 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.12 vs. limit=12.0 2023-10-09 21:02:02,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=22.5 2023-10-09 21:02:15,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=130666.66666666667, ans=0.125 2023-10-09 21:02:16,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=130666.66666666667, ans=0.0 2023-10-09 21:02:19,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130666.66666666667, ans=0.1 2023-10-09 21:02:26,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130713.33333333333, ans=0.1 2023-10-09 21:02:37,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.928e+02 2.179e+02 2.588e+02 4.693e+02, threshold=4.359e+02, percent-clipped=2.0 2023-10-09 21:02:45,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130760.0, ans=0.1 2023-10-09 21:02:55,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=130806.66666666667, ans=0.5 2023-10-09 21:03:01,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=130853.33333333333, ans=0.0 2023-10-09 21:03:01,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=130853.33333333333, ans=0.125 2023-10-09 21:03:22,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130900.0, ans=0.1 2023-10-09 21:03:40,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130993.33333333333, ans=0.1 2023-10-09 21:03:46,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=130993.33333333333, ans=0.125 2023-10-09 21:03:56,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=131040.0, ans=15.0 2023-10-09 21:03:59,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=131086.66666666666, ans=0.0 2023-10-09 21:04:03,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=131086.66666666666, ans=0.1 2023-10-09 21:04:10,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=131086.66666666666, ans=0.1 2023-10-09 21:04:13,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=131133.33333333334, ans=0.0 2023-10-09 21:04:25,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.12 vs. limit=6.0 2023-10-09 21:04:33,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.813e+02 2.082e+02 2.326e+02 3.136e+02, threshold=4.164e+02, percent-clipped=0.0 2023-10-09 21:04:40,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=131226.66666666666, ans=0.125 2023-10-09 21:04:48,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=131273.33333333334, ans=0.0 2023-10-09 21:04:53,873 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.60 vs. limit=15.0 2023-10-09 21:04:58,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=131273.33333333334, ans=0.1 2023-10-09 21:05:34,943 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-10-09 21:06:35,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=131646.66666666666, ans=0.125 2023-10-09 21:06:38,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=131693.33333333334, ans=0.0 2023-10-09 21:06:39,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.954e+02 2.283e+02 2.543e+02 4.639e+02, threshold=4.567e+02, percent-clipped=1.0 2023-10-09 21:06:40,735 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.74 vs. limit=22.5 2023-10-09 21:07:18,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=131833.33333333334, ans=0.0 2023-10-09 21:07:26,421 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=22.5 2023-10-09 21:07:30,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.47 vs. limit=15.0 2023-10-09 21:07:37,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=131880.0, ans=0.0 2023-10-09 21:07:48,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.37 vs. limit=15.0 2023-10-09 21:08:20,446 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-10-09 21:08:26,283 INFO [train.py:1031] (0/4) Epoch 3, batch 1000, loss[loss=0.2828, simple_loss=0.3516, pruned_loss=0.107, over 16741.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3488, pruned_loss=0.1017, over 12962377.68 frames. ], batch size: 56, lr: 1.52e-02, grad_scale: 32.0 2023-10-09 21:08:26,830 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.95 vs. limit=15.0 2023-10-09 21:08:34,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.63 vs. limit=15.0 2023-10-09 21:08:39,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.798e+02 2.081e+02 2.413e+02 4.199e+02, threshold=4.163e+02, percent-clipped=0.0 2023-10-09 21:08:50,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=132206.66666666666, ans=0.0 2023-10-09 21:08:54,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-10-09 21:08:56,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=132206.66666666666, ans=0.035 2023-10-09 21:09:02,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=132253.33333333334, ans=6.0 2023-10-09 21:09:11,367 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.26 vs. limit=15.0 2023-10-09 21:09:23,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=132346.66666666666, ans=0.125 2023-10-09 21:09:42,413 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.58 vs. limit=15.0 2023-10-09 21:09:48,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=132440.0, ans=0.125 2023-10-09 21:10:00,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=132486.66666666666, ans=0.125 2023-10-09 21:10:33,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=132580.0, ans=0.125 2023-10-09 21:10:38,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.820e+02 2.013e+02 2.251e+02 3.213e+02, threshold=4.025e+02, percent-clipped=0.0 2023-10-09 21:11:11,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=132720.0, ans=0.125 2023-10-09 21:11:19,368 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.42 vs. limit=22.5 2023-10-09 21:12:01,463 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.25 vs. limit=6.0 2023-10-09 21:12:05,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=132906.66666666666, ans=0.125 2023-10-09 21:12:12,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=132906.66666666666, ans=0.0 2023-10-09 21:12:36,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=133000.0, ans=0.125 2023-10-09 21:12:50,545 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.77 vs. limit=15.0 2023-10-09 21:12:50,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.853e+02 2.065e+02 2.454e+02 3.688e+02, threshold=4.130e+02, percent-clipped=0.0 2023-10-09 21:12:51,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.18 vs. limit=15.0 2023-10-09 21:13:02,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=133140.0, ans=0.125 2023-10-09 21:13:19,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=133186.66666666666, ans=0.0 2023-10-09 21:13:39,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.46 vs. limit=10.0 2023-10-09 21:13:41,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=133280.0, ans=0.125 2023-10-09 21:14:25,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=133466.66666666666, ans=0.0 2023-10-09 21:14:31,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=133466.66666666666, ans=0.125 2023-10-09 21:14:47,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.859e+02 2.175e+02 2.571e+02 4.066e+02, threshold=4.350e+02, percent-clipped=0.0 2023-10-09 21:15:06,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=133606.66666666666, ans=0.125 2023-10-09 21:15:36,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=133746.66666666666, ans=0.125 2023-10-09 21:15:42,912 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.65 vs. limit=15.0 2023-10-09 21:15:46,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=133793.33333333334, ans=0.125 2023-10-09 21:15:48,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=133793.33333333334, ans=0.125 2023-10-09 21:16:01,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=133840.0, ans=0.125 2023-10-09 21:16:20,713 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.91 vs. limit=6.0 2023-10-09 21:16:33,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=133933.33333333334, ans=0.09899494936611666 2023-10-09 21:16:43,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=133980.0, ans=0.125 2023-10-09 21:16:48,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=134026.66666666666, ans=0.025 2023-10-09 21:16:50,284 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.835e+02 2.094e+02 2.277e+02 3.295e+02, threshold=4.189e+02, percent-clipped=0.0 2023-10-09 21:17:04,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=134073.33333333334, ans=0.125 2023-10-09 21:17:06,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=134073.33333333334, ans=0.125 2023-10-09 21:17:07,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=134073.33333333334, ans=0.0 2023-10-09 21:17:22,001 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.09 vs. limit=22.5 2023-10-09 21:17:43,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=134213.33333333334, ans=0.0 2023-10-09 21:17:46,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=134213.33333333334, ans=0.0 2023-10-09 21:18:01,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134306.66666666666, ans=0.1 2023-10-09 21:18:13,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=134353.33333333334, ans=0.125 2023-10-09 21:18:25,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=134400.0, ans=0.2 2023-10-09 21:18:29,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134400.0, ans=0.1 2023-10-09 21:18:38,864 INFO [train.py:1031] (0/4) Epoch 3, batch 1500, loss[loss=0.2898, simple_loss=0.358, pruned_loss=0.1108, over 16335.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3464, pruned_loss=0.1003, over 17354197.50 frames. ], batch size: 50, lr: 1.51e-02, grad_scale: 16.0 2023-10-09 21:18:40,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=134446.66666666666, ans=0.125 2023-10-09 21:18:51,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.804e+02 2.060e+02 2.368e+02 3.438e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-09 21:18:51,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=134493.33333333334, ans=0.0 2023-10-09 21:19:38,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=134680.0, ans=0.125 2023-10-09 21:19:49,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=134726.66666666666, ans=0.0 2023-10-09 21:20:30,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=134913.33333333334, ans=0.025 2023-10-09 21:20:33,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=134913.33333333334, ans=0.125 2023-10-09 21:20:38,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=134913.33333333334, ans=0.04949747468305833 2023-10-09 21:20:39,096 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-10-09 21:20:39,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=134913.33333333334, ans=0.0 2023-10-09 21:20:44,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.868e+02 2.118e+02 2.410e+02 3.223e+02, threshold=4.236e+02, percent-clipped=0.0 2023-10-09 21:21:09,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=135053.33333333334, ans=0.125 2023-10-09 21:21:16,855 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:21:40,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=135146.66666666666, ans=0.2 2023-10-09 21:21:47,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=135193.33333333334, ans=0.125 2023-10-09 21:22:11,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=135286.66666666666, ans=0.0 2023-10-09 21:22:18,374 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.70 vs. limit=15.0 2023-10-09 21:22:41,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=135380.0, ans=0.125 2023-10-09 21:22:45,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.954e+02 2.268e+02 2.619e+02 4.378e+02, threshold=4.536e+02, percent-clipped=1.0 2023-10-09 21:23:09,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=135520.0, ans=0.0 2023-10-09 21:23:24,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135566.66666666666, ans=0.1 2023-10-09 21:23:27,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=135566.66666666666, ans=0.0 2023-10-09 21:23:36,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=135613.33333333334, ans=0.0 2023-10-09 21:23:49,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=135660.0, ans=0.125 2023-10-09 21:23:51,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=135706.66666666666, ans=0.07 2023-10-09 21:23:55,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=135706.66666666666, ans=0.125 2023-10-09 21:24:32,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=135846.66666666666, ans=0.125 2023-10-09 21:24:46,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.886e+02 2.087e+02 2.417e+02 3.075e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-09 21:25:28,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=136033.33333333334, ans=0.0 2023-10-09 21:25:32,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=136080.0, ans=0.125 2023-10-09 21:25:35,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=136080.0, ans=0.125 2023-10-09 21:25:38,446 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.71 vs. limit=10.0 2023-10-09 21:25:40,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=136080.0, ans=0.1 2023-10-09 21:25:42,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136080.0, ans=0.1 2023-10-09 21:25:59,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=136173.33333333334, ans=0.0 2023-10-09 21:26:13,074 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:26:17,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=136220.0, ans=0.0 2023-10-09 21:26:28,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=136266.66666666666, ans=0.0 2023-10-09 21:26:44,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.886e+02 2.147e+02 2.562e+02 4.596e+02, threshold=4.294e+02, percent-clipped=1.0 2023-10-09 21:27:01,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=136406.66666666666, ans=0.025 2023-10-09 21:27:10,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=136453.33333333334, ans=0.2 2023-10-09 21:27:49,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=136593.33333333334, ans=0.07 2023-10-09 21:28:30,160 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.70 vs. limit=15.0 2023-10-09 21:28:31,720 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.51 vs. limit=15.0 2023-10-09 21:28:37,468 INFO [train.py:1031] (0/4) Epoch 3, batch 2000, loss[loss=0.3063, simple_loss=0.3735, pruned_loss=0.1195, over 16526.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3467, pruned_loss=0.1001, over 20775663.21 frames. ], batch size: 266, lr: 1.50e-02, grad_scale: 32.0 2023-10-09 21:28:39,519 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:28:45,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=12.0 2023-10-09 21:28:47,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=136826.66666666666, ans=0.025 2023-10-09 21:28:50,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.938e+02 2.325e+02 2.669e+02 4.307e+02, threshold=4.649e+02, percent-clipped=1.0 2023-10-09 21:28:53,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=136826.66666666666, ans=0.015 2023-10-09 21:29:11,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=136873.33333333334, ans=0.125 2023-10-09 21:29:12,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=136873.33333333334, ans=0.125 2023-10-09 21:29:29,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=136920.0, ans=0.125 2023-10-09 21:29:35,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=136966.66666666666, ans=0.125 2023-10-09 21:29:56,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=137060.0, ans=0.125 2023-10-09 21:30:00,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=137060.0, ans=0.125 2023-10-09 21:30:26,823 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-10-09 21:30:51,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=137246.66666666666, ans=0.125 2023-10-09 21:30:53,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=137246.66666666666, ans=0.125 2023-10-09 21:31:07,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=137293.33333333334, ans=10.0 2023-10-09 21:31:07,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=137293.33333333334, ans=0.04949747468305833 2023-10-09 21:31:07,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.905e+02 2.125e+02 2.441e+02 3.596e+02, threshold=4.250e+02, percent-clipped=0.0 2023-10-09 21:31:16,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=137293.33333333334, ans=0.125 2023-10-09 21:31:18,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=137293.33333333334, ans=0.125 2023-10-09 21:31:31,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137340.0, ans=0.1 2023-10-09 21:31:34,843 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.92 vs. limit=22.5 2023-10-09 21:31:51,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=137433.33333333334, ans=0.2 2023-10-09 21:32:02,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=137480.0, ans=0.125 2023-10-09 21:32:28,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=137573.33333333334, ans=0.2 2023-10-09 21:32:51,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=137666.66666666666, ans=0.2 2023-10-09 21:33:02,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=137713.33333333334, ans=0.2 2023-10-09 21:33:12,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=137713.33333333334, ans=0.125 2023-10-09 21:33:15,723 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.915e+02 2.105e+02 2.534e+02 3.792e+02, threshold=4.210e+02, percent-clipped=0.0 2023-10-09 21:33:25,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=137806.66666666666, ans=10.0 2023-10-09 21:33:46,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=137900.0, ans=0.2 2023-10-09 21:33:55,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137946.66666666666, ans=0.1 2023-10-09 21:33:56,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=137946.66666666666, ans=0.0 2023-10-09 21:34:09,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.54 vs. limit=5.0 2023-10-09 21:34:13,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.27 vs. limit=22.5 2023-10-09 21:34:57,310 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-10-09 21:34:57,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=138180.0, ans=0.125 2023-10-09 21:35:02,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.965e+02 2.340e+02 2.742e+02 3.516e+02, threshold=4.680e+02, percent-clipped=0.0 2023-10-09 21:35:04,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=138226.66666666666, ans=0.0 2023-10-09 21:35:27,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=138320.0, ans=0.0 2023-10-09 21:35:43,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=138366.66666666666, ans=0.125 2023-10-09 21:35:47,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=138413.33333333334, ans=0.0 2023-10-09 21:36:18,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=138553.33333333334, ans=0.125 2023-10-09 21:36:18,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138553.33333333334, ans=0.1 2023-10-09 21:36:19,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=138553.33333333334, ans=0.125 2023-10-09 21:36:29,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-10-09 21:36:36,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=138600.0, ans=0.125 2023-10-09 21:36:54,371 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.892e+02 2.050e+02 2.380e+02 3.780e+02, threshold=4.101e+02, percent-clipped=0.0 2023-10-09 21:37:00,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=138693.33333333334, ans=0.125 2023-10-09 21:37:01,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=138693.33333333334, ans=0.125 2023-10-09 21:37:07,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=138740.0, ans=0.2 2023-10-09 21:37:09,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138740.0, ans=0.1 2023-10-09 21:37:17,558 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-10-09 21:37:20,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=138786.66666666666, ans=0.125 2023-10-09 21:37:31,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=138833.33333333334, ans=0.125 2023-10-09 21:37:53,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=138973.33333333334, ans=0.0 2023-10-09 21:37:59,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=138973.33333333334, ans=0.0 2023-10-09 21:38:05,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=139020.0, ans=0.2 2023-10-09 21:38:06,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=139020.0, ans=0.125 2023-10-09 21:38:27,813 INFO [train.py:1031] (0/4) Epoch 3, batch 2500, loss[loss=0.3187, simple_loss=0.3607, pruned_loss=0.1384, over 15608.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3464, pruned_loss=0.1001, over 23430533.34 frames. ], batch size: 350, lr: 1.49e-02, grad_scale: 32.0 2023-10-09 21:38:34,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=139113.33333333334, ans=0.125 2023-10-09 21:38:40,032 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.944e+02 2.156e+02 2.452e+02 4.104e+02, threshold=4.312e+02, percent-clipped=1.0 2023-10-09 21:39:16,004 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.77 vs. limit=12.0 2023-10-09 21:40:18,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=139580.0, ans=0.125 2023-10-09 21:40:25,805 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:40:29,029 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.821e+02 2.217e+02 2.583e+02 3.986e+02, threshold=4.434e+02, percent-clipped=0.0 2023-10-09 21:40:36,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=139626.66666666666, ans=0.1 2023-10-09 21:40:59,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139766.66666666666, ans=0.1 2023-10-09 21:41:25,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=139860.0, ans=0.0 2023-10-09 21:42:22,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.998e+02 2.253e+02 2.659e+02 5.228e+02, threshold=4.506e+02, percent-clipped=1.0 2023-10-09 21:43:21,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=140326.66666666666, ans=0.1 2023-10-09 21:43:40,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=140373.33333333334, ans=0.05 2023-10-09 21:44:03,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=140466.66666666666, ans=0.125 2023-10-09 21:44:22,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.829e+02 2.023e+02 2.246e+02 3.565e+02, threshold=4.046e+02, percent-clipped=0.0 2023-10-09 21:45:06,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=140700.0, ans=0.125 2023-10-09 21:45:08,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=140700.0, ans=0.0 2023-10-09 21:45:31,991 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:45:44,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=140840.0, ans=0.0 2023-10-09 21:45:48,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=140840.0, ans=0.2 2023-10-09 21:46:01,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=140886.66666666666, ans=0.2 2023-10-09 21:46:03,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=140886.66666666666, ans=0.125 2023-10-09 21:46:27,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-10-09 21:46:32,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.916e+02 2.204e+02 2.611e+02 3.602e+02, threshold=4.408e+02, percent-clipped=0.0 2023-10-09 21:46:34,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.33 vs. limit=15.0 2023-10-09 21:46:47,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.82 vs. limit=22.5 2023-10-09 21:47:00,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=141120.0, ans=0.125 2023-10-09 21:47:00,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.77 vs. limit=22.5 2023-10-09 21:47:01,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=141120.0, ans=0.0 2023-10-09 21:47:10,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141166.66666666666, ans=0.1 2023-10-09 21:47:20,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.89 vs. limit=12.0 2023-10-09 21:47:26,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=141260.0, ans=0.125 2023-10-09 21:47:30,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=141260.0, ans=0.125 2023-10-09 21:47:35,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141260.0, ans=0.1 2023-10-09 21:47:45,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141306.66666666666, ans=0.1 2023-10-09 21:47:48,821 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.04 vs. limit=15.0 2023-10-09 21:47:52,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=141353.33333333334, ans=0.125 2023-10-09 21:47:52,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=141353.33333333334, ans=0.1 2023-10-09 21:47:54,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=141353.33333333334, ans=0.0 2023-10-09 21:48:09,253 INFO [train.py:1031] (0/4) Epoch 3, batch 3000, loss[loss=0.2661, simple_loss=0.3377, pruned_loss=0.09732, over 16604.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3449, pruned_loss=0.0995, over 25516269.69 frames. ], batch size: 50, lr: 1.47e-02, grad_scale: 32.0 2023-10-09 21:48:21,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.772e+02 2.067e+02 2.351e+02 4.015e+02, threshold=4.134e+02, percent-clipped=0.0 2023-10-09 21:48:23,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=141493.33333333334, ans=0.1 2023-10-09 21:48:25,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=141493.33333333334, ans=0.0 2023-10-09 21:48:43,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=141586.66666666666, ans=0.125 2023-10-09 21:48:49,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=141586.66666666666, ans=15.0 2023-10-09 21:49:23,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=141726.66666666666, ans=0.125 2023-10-09 21:49:39,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=141820.0, ans=0.125 2023-10-09 21:49:39,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=141820.0, ans=0.0 2023-10-09 21:49:41,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=141820.0, ans=0.2 2023-10-09 21:49:45,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=141820.0, ans=0.2 2023-10-09 21:49:45,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=141820.0, ans=0.125 2023-10-09 21:49:48,195 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:49:54,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=141866.66666666666, ans=0.0 2023-10-09 21:50:04,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=141913.33333333334, ans=0.125 2023-10-09 21:50:09,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=141913.33333333334, ans=0.07 2023-10-09 21:50:14,723 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.90 vs. limit=15.0 2023-10-09 21:50:17,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.859e+02 2.097e+02 2.355e+02 3.302e+02, threshold=4.195e+02, percent-clipped=0.0 2023-10-09 21:50:25,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=141960.0, ans=0.125 2023-10-09 21:50:43,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=142053.33333333334, ans=0.0 2023-10-09 21:51:38,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=142286.66666666666, ans=0.125 2023-10-09 21:51:46,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=142333.33333333334, ans=0.0 2023-10-09 21:52:02,903 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=12.0 2023-10-09 21:52:11,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.874e+02 2.180e+02 2.546e+02 4.109e+02, threshold=4.360e+02, percent-clipped=0.0 2023-10-09 21:52:45,972 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.32 vs. limit=22.5 2023-10-09 21:52:50,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=142566.66666666666, ans=0.125 2023-10-09 21:52:52,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=142566.66666666666, ans=0.2 2023-10-09 21:53:25,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=142660.0, ans=0.125 2023-10-09 21:53:45,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=142753.33333333334, ans=0.2 2023-10-09 21:53:49,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=142753.33333333334, ans=0.2 2023-10-09 21:54:16,858 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.825e+02 2.054e+02 2.290e+02 3.558e+02, threshold=4.107e+02, percent-clipped=0.0 2023-10-09 21:54:53,231 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 21:54:56,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=143033.33333333334, ans=0.125 2023-10-09 21:55:27,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=143173.33333333334, ans=0.2 2023-10-09 21:55:30,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.60 vs. limit=22.5 2023-10-09 21:55:38,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=143220.0, ans=0.125 2023-10-09 21:55:43,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=143220.0, ans=0.2 2023-10-09 21:55:57,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=143313.33333333334, ans=0.2 2023-10-09 21:56:04,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=143313.33333333334, ans=0.2 2023-10-09 21:56:09,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=143360.0, ans=0.125 2023-10-09 21:56:10,457 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.980e+02 2.273e+02 2.671e+02 4.418e+02, threshold=4.547e+02, percent-clipped=1.0 2023-10-09 21:56:23,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=143406.66666666666, ans=0.125 2023-10-09 21:56:24,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=143406.66666666666, ans=0.0 2023-10-09 21:56:33,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=143453.33333333334, ans=0.125 2023-10-09 21:56:53,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=143546.66666666666, ans=0.125 2023-10-09 21:56:58,848 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-10-09 21:57:15,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=143640.0, ans=0.125 2023-10-09 21:57:39,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=143733.33333333334, ans=0.125 2023-10-09 21:57:47,850 INFO [train.py:1031] (0/4) Epoch 3, batch 3500, loss[loss=0.262, simple_loss=0.3357, pruned_loss=0.09414, over 16597.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3445, pruned_loss=0.09931, over 27126237.86 frames. ], batch size: 61, lr: 1.46e-02, grad_scale: 32.0 2023-10-09 21:57:56,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=143780.0, ans=0.125 2023-10-09 21:58:02,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.900e+02 2.131e+02 2.507e+02 3.406e+02, threshold=4.262e+02, percent-clipped=0.0 2023-10-09 21:58:12,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=143873.33333333334, ans=0.125 2023-10-09 21:58:12,552 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=22.5 2023-10-09 21:58:16,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=143873.33333333334, ans=0.125 2023-10-09 21:59:06,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144106.66666666666, ans=0.1 2023-10-09 21:59:22,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=144106.66666666666, ans=0.1 2023-10-09 21:59:31,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=144153.33333333334, ans=0.07 2023-10-09 22:00:03,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.868e+02 2.114e+02 2.596e+02 3.837e+02, threshold=4.227e+02, percent-clipped=0.0 2023-10-09 22:00:14,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=144340.0, ans=0.035 2023-10-09 22:00:21,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=144340.0, ans=0.125 2023-10-09 22:00:34,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144386.66666666666, ans=0.1 2023-10-09 22:00:44,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=144433.33333333334, ans=0.0 2023-10-09 22:00:47,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.65 vs. limit=15.0 2023-10-09 22:00:55,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144480.0, ans=0.1 2023-10-09 22:00:57,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-10-09 22:01:07,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=144526.66666666666, ans=0.0 2023-10-09 22:01:07,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=144526.66666666666, ans=22.5 2023-10-09 22:01:16,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.07 vs. limit=10.0 2023-10-09 22:01:30,116 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.22 vs. limit=15.0 2023-10-09 22:01:48,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=144713.33333333334, ans=0.125 2023-10-09 22:01:51,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=144713.33333333334, ans=0.125 2023-10-09 22:02:01,718 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.929e+02 2.281e+02 2.631e+02 4.263e+02, threshold=4.563e+02, percent-clipped=1.0 2023-10-09 22:02:09,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=144760.0, ans=0.125 2023-10-09 22:02:41,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=144900.0, ans=0.125 2023-10-09 22:02:45,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=144900.0, ans=0.125 2023-10-09 22:02:51,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=144946.66666666666, ans=0.0 2023-10-09 22:03:08,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=144993.33333333334, ans=0.0 2023-10-09 22:03:08,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=144993.33333333334, ans=0.125 2023-10-09 22:03:11,076 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-10-09 22:03:30,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=145086.66666666666, ans=0.0 2023-10-09 22:03:31,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=145086.66666666666, ans=0.125 2023-10-09 22:03:54,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=145180.0, ans=0.125 2023-10-09 22:04:06,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.904e+02 2.136e+02 2.407e+02 3.141e+02, threshold=4.272e+02, percent-clipped=0.0 2023-10-09 22:04:21,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=145273.33333333334, ans=0.125 2023-10-09 22:04:27,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=145273.33333333334, ans=0.125 2023-10-09 22:04:36,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=145320.0, ans=0.125 2023-10-09 22:04:54,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=145413.33333333334, ans=0.0 2023-10-09 22:05:16,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=145506.66666666666, ans=0.07 2023-10-09 22:05:16,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145506.66666666666, ans=0.1 2023-10-09 22:05:36,309 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.94 vs. limit=6.0 2023-10-09 22:06:04,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.872e+02 2.136e+02 2.488e+02 3.570e+02, threshold=4.272e+02, percent-clipped=0.0 2023-10-09 22:06:14,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=145740.0, ans=0.07 2023-10-09 22:06:22,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=145786.66666666666, ans=15.0 2023-10-09 22:06:25,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=145786.66666666666, ans=0.125 2023-10-09 22:06:26,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.73 vs. limit=10.0 2023-10-09 22:06:28,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=145786.66666666666, ans=0.125 2023-10-09 22:06:48,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=145880.0, ans=0.0 2023-10-09 22:07:28,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=146066.66666666666, ans=0.125 2023-10-09 22:07:36,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.07 vs. limit=15.0 2023-10-09 22:07:40,644 INFO [train.py:1031] (0/4) Epoch 3, batch 4000, loss[loss=0.2962, simple_loss=0.3719, pruned_loss=0.1103, over 16038.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3431, pruned_loss=0.09851, over 28393772.02 frames. ], batch size: 297, lr: 1.45e-02, grad_scale: 32.0 2023-10-09 22:07:59,575 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.880e+02 2.112e+02 2.537e+02 3.409e+02, threshold=4.224e+02, percent-clipped=0.0 2023-10-09 22:08:03,495 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:08:17,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=146206.66666666666, ans=0.125 2023-10-09 22:09:09,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=146440.0, ans=0.07 2023-10-09 22:09:23,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=146486.66666666666, ans=0.125 2023-10-09 22:09:26,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=146486.66666666666, ans=0.125 2023-10-09 22:09:29,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=22.5 2023-10-09 22:09:54,737 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.935e+02 2.190e+02 2.491e+02 3.231e+02, threshold=4.381e+02, percent-clipped=0.0 2023-10-09 22:10:08,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.53 vs. limit=15.0 2023-10-09 22:10:09,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=146673.33333333334, ans=0.125 2023-10-09 22:10:14,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=146720.0, ans=0.1 2023-10-09 22:10:14,914 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.00 vs. limit=15.0 2023-10-09 22:10:16,302 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:10:31,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=146766.66666666666, ans=0.125 2023-10-09 22:10:40,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=146813.33333333334, ans=0.125 2023-10-09 22:12:05,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.882e+02 2.150e+02 2.744e+02 3.738e+02, threshold=4.300e+02, percent-clipped=0.0 2023-10-09 22:12:06,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=147093.33333333334, ans=0.2 2023-10-09 22:12:16,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=147140.0, ans=0.0 2023-10-09 22:12:19,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147140.0, ans=0.0 2023-10-09 22:12:24,665 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-10-09 22:12:31,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=147186.66666666666, ans=0.5 2023-10-09 22:12:51,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=147280.0, ans=0.1 2023-10-09 22:13:01,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=147326.66666666666, ans=0.0 2023-10-09 22:13:07,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=147326.66666666666, ans=0.125 2023-10-09 22:13:08,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147326.66666666666, ans=0.1 2023-10-09 22:13:23,979 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:14:01,520 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.894e+02 2.086e+02 2.449e+02 4.266e+02, threshold=4.172e+02, percent-clipped=0.0 2023-10-09 22:14:14,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=147606.66666666666, ans=0.125 2023-10-09 22:14:24,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=147653.33333333334, ans=0.0 2023-10-09 22:14:34,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147700.0, ans=0.1 2023-10-09 22:14:37,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=147700.0, ans=0.0 2023-10-09 22:14:38,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=147700.0, ans=0.125 2023-10-09 22:14:40,161 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-10-09 22:14:46,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=147746.66666666666, ans=0.125 2023-10-09 22:14:55,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=147793.33333333334, ans=0.125 2023-10-09 22:15:00,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=147793.33333333334, ans=0.0 2023-10-09 22:15:01,602 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=22.5 2023-10-09 22:15:05,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147793.33333333334, ans=0.0 2023-10-09 22:15:08,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=147793.33333333334, ans=0.2 2023-10-09 22:15:09,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=147840.0, ans=0.0 2023-10-09 22:15:24,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=147886.66666666666, ans=0.0 2023-10-09 22:15:28,277 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.13 vs. limit=15.0 2023-10-09 22:15:38,224 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.80 vs. limit=15.0 2023-10-09 22:15:44,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=147933.33333333334, ans=0.0 2023-10-09 22:16:02,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 2.131e+02 2.486e+02 2.958e+02 3.911e+02, threshold=4.973e+02, percent-clipped=0.0 2023-10-09 22:16:03,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=148026.66666666666, ans=0.125 2023-10-09 22:16:17,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=148073.33333333334, ans=0.125 2023-10-09 22:16:17,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=148073.33333333334, ans=0.0 2023-10-09 22:17:12,909 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-10-09 22:17:18,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=148306.66666666666, ans=0.125 2023-10-09 22:17:19,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148306.66666666666, ans=0.1 2023-10-09 22:17:24,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148306.66666666666, ans=0.1 2023-10-09 22:17:36,410 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-10-09 22:17:51,553 INFO [train.py:1031] (0/4) Epoch 3, batch 4500, loss[loss=0.2348, simple_loss=0.2876, pruned_loss=0.09098, over 12613.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.343, pruned_loss=0.09804, over 29359604.37 frames. ], batch size: 440, lr: 1.44e-02, grad_scale: 32.0 2023-10-09 22:18:05,663 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.72 vs. limit=15.0 2023-10-09 22:18:06,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.781e+02 1.940e+02 2.181e+02 3.401e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-09 22:18:19,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=148540.0, ans=0.04949747468305833 2023-10-09 22:18:23,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=148586.66666666666, ans=0.0 2023-10-09 22:19:02,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148726.66666666666, ans=0.1 2023-10-09 22:19:14,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.98 vs. limit=15.0 2023-10-09 22:19:18,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=148820.0, ans=0.125 2023-10-09 22:19:46,070 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.95 vs. limit=10.0 2023-10-09 22:19:47,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=148913.33333333334, ans=0.1 2023-10-09 22:19:55,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.908e+02 2.138e+02 2.461e+02 3.826e+02, threshold=4.275e+02, percent-clipped=0.0 2023-10-09 22:20:21,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.63 vs. limit=15.0 2023-10-09 22:20:29,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=149100.0, ans=0.2 2023-10-09 22:20:40,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=149146.66666666666, ans=0.125 2023-10-09 22:20:44,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149146.66666666666, ans=0.1 2023-10-09 22:20:58,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.96 vs. limit=15.0 2023-10-09 22:21:15,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=149286.66666666666, ans=0.125 2023-10-09 22:21:17,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=149286.66666666666, ans=0.0 2023-10-09 22:21:20,114 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-32000.pt 2023-10-09 22:21:33,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=149380.0, ans=0.125 2023-10-09 22:21:42,681 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.49 vs. limit=22.5 2023-10-09 22:21:48,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=149426.66666666666, ans=0.125 2023-10-09 22:21:49,886 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.900e+02 2.146e+02 2.556e+02 4.640e+02, threshold=4.292e+02, percent-clipped=1.0 2023-10-09 22:21:59,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=149473.33333333334, ans=0.0 2023-10-09 22:22:03,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=149473.33333333334, ans=0.0 2023-10-09 22:22:11,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=149520.0, ans=0.125 2023-10-09 22:22:26,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-10-09 22:22:28,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.41 vs. limit=15.0 2023-10-09 22:22:37,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=149613.33333333334, ans=0.0 2023-10-09 22:23:00,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=149753.33333333334, ans=0.025 2023-10-09 22:23:08,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=149753.33333333334, ans=0.2 2023-10-09 22:23:14,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=149800.0, ans=0.125 2023-10-09 22:23:24,700 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:23:31,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=149846.66666666666, ans=0.0 2023-10-09 22:23:39,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=149893.33333333334, ans=0.09899494936611666 2023-10-09 22:23:41,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.878e+02 2.109e+02 2.447e+02 3.980e+02, threshold=4.218e+02, percent-clipped=0.0 2023-10-09 22:23:49,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=149940.0, ans=0.0 2023-10-09 22:23:58,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=149940.0, ans=0.125 2023-10-09 22:24:13,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=150033.33333333334, ans=0.2 2023-10-09 22:24:13,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=150033.33333333334, ans=0.07 2023-10-09 22:24:20,025 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.46 vs. limit=6.0 2023-10-09 22:24:33,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=150080.0, ans=0.025 2023-10-09 22:24:39,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.78 vs. limit=15.0 2023-10-09 22:25:07,135 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2023-10-09 22:25:14,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=150266.66666666666, ans=0.125 2023-10-09 22:25:28,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=150313.33333333334, ans=0.2 2023-10-09 22:25:33,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=150360.0, ans=0.2 2023-10-09 22:25:38,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.754e+02 1.888e+02 2.100e+02 3.041e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-09 22:25:50,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=150406.66666666666, ans=0.0 2023-10-09 22:26:21,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150500.0, ans=0.1 2023-10-09 22:26:36,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=150593.33333333334, ans=0.125 2023-10-09 22:26:40,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=150593.33333333334, ans=0.1 2023-10-09 22:26:41,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=150593.33333333334, ans=0.2 2023-10-09 22:26:43,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=150593.33333333334, ans=0.0 2023-10-09 22:26:54,070 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.55 vs. limit=22.5 2023-10-09 22:27:03,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=150686.66666666666, ans=0.125 2023-10-09 22:27:22,669 INFO [train.py:1031] (0/4) Epoch 3, batch 5000, loss[loss=0.2752, simple_loss=0.3497, pruned_loss=0.1004, over 16852.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3424, pruned_loss=0.09777, over 30124566.06 frames. ], batch size: 175, lr: 1.43e-02, grad_scale: 32.0 2023-10-09 22:27:26,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.25 vs. limit=15.0 2023-10-09 22:27:37,484 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:27:39,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.945e+02 2.133e+02 2.539e+02 3.792e+02, threshold=4.266e+02, percent-clipped=1.0 2023-10-09 22:27:39,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=150826.66666666666, ans=0.05 2023-10-09 22:27:52,525 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2023-10-09 22:28:08,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=150920.0, ans=0.125 2023-10-09 22:28:09,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=150966.66666666666, ans=0.125 2023-10-09 22:28:20,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151013.33333333334, ans=0.1 2023-10-09 22:28:24,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=151013.33333333334, ans=0.125 2023-10-09 22:28:25,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=151013.33333333334, ans=0.0 2023-10-09 22:28:40,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151060.0, ans=0.1 2023-10-09 22:28:52,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=151106.66666666666, ans=0.125 2023-10-09 22:28:52,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=151106.66666666666, ans=0.125 2023-10-09 22:28:59,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=151153.33333333334, ans=0.0 2023-10-09 22:29:02,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=151153.33333333334, ans=0.0 2023-10-09 22:29:11,344 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.50 vs. limit=15.0 2023-10-09 22:29:26,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.63 vs. limit=15.0 2023-10-09 22:29:27,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=151246.66666666666, ans=0.125 2023-10-09 22:29:30,022 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.66 vs. limit=22.5 2023-10-09 22:29:37,525 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.851e+02 2.076e+02 2.455e+02 3.818e+02, threshold=4.153e+02, percent-clipped=0.0 2023-10-09 22:29:46,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=151340.0, ans=0.0 2023-10-09 22:29:49,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=151340.0, ans=0.125 2023-10-09 22:30:06,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151433.33333333334, ans=0.1 2023-10-09 22:30:07,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151433.33333333334, ans=0.1 2023-10-09 22:30:08,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=151433.33333333334, ans=0.0 2023-10-09 22:30:15,719 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-10-09 22:30:27,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.41 vs. limit=22.5 2023-10-09 22:30:36,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=151526.66666666666, ans=0.07 2023-10-09 22:30:45,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=151573.33333333334, ans=0.0 2023-10-09 22:30:49,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=151620.0, ans=0.125 2023-10-09 22:30:54,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=151620.0, ans=0.0 2023-10-09 22:30:56,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=151620.0, ans=0.2 2023-10-09 22:31:20,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.10 vs. limit=10.0 2023-10-09 22:31:20,068 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.86 vs. limit=15.0 2023-10-09 22:31:23,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.879e+02 2.163e+02 2.658e+02 3.647e+02, threshold=4.327e+02, percent-clipped=0.0 2023-10-09 22:31:44,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=151853.33333333334, ans=0.125 2023-10-09 22:32:20,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=151993.33333333334, ans=0.125 2023-10-09 22:32:23,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=151993.33333333334, ans=0.0 2023-10-09 22:32:26,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=151993.33333333334, ans=0.0 2023-10-09 22:32:29,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=152040.0, ans=0.05 2023-10-09 22:32:53,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=152133.33333333334, ans=0.0 2023-10-09 22:32:59,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152133.33333333334, ans=0.1 2023-10-09 22:33:17,308 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.879e+02 2.023e+02 2.336e+02 3.757e+02, threshold=4.045e+02, percent-clipped=0.0 2023-10-09 22:33:44,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=152320.0, ans=0.125 2023-10-09 22:33:56,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=152366.66666666666, ans=0.125 2023-10-09 22:33:57,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=152366.66666666666, ans=0.2 2023-10-09 22:34:27,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=152506.66666666666, ans=0.07 2023-10-09 22:34:30,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=152506.66666666666, ans=0.125 2023-10-09 22:34:34,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.58 vs. limit=12.0 2023-10-09 22:34:37,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=152553.33333333334, ans=0.125 2023-10-09 22:34:42,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=152600.0, ans=0.0 2023-10-09 22:34:43,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=152600.0, ans=0.025 2023-10-09 22:34:49,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152600.0, ans=0.1 2023-10-09 22:34:49,405 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:34:50,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=152600.0, ans=0.125 2023-10-09 22:34:55,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=152646.66666666666, ans=0.0 2023-10-09 22:34:57,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=152646.66666666666, ans=0.125 2023-10-09 22:34:59,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.14 vs. limit=12.0 2023-10-09 22:35:02,863 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2023-10-09 22:35:04,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=152693.33333333334, ans=0.2 2023-10-09 22:35:07,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.846e+02 2.041e+02 2.254e+02 3.753e+02, threshold=4.081e+02, percent-clipped=0.0 2023-10-09 22:35:14,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=152740.0, ans=0.025 2023-10-09 22:35:15,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=152740.0, ans=0.125 2023-10-09 22:35:46,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=152880.0, ans=0.125 2023-10-09 22:35:47,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=152880.0, ans=0.0 2023-10-09 22:36:18,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152973.33333333334, ans=0.1 2023-10-09 22:36:21,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=153020.0, ans=0.0 2023-10-09 22:36:29,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=153066.66666666666, ans=0.125 2023-10-09 22:36:30,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=153066.66666666666, ans=10.0 2023-10-09 22:36:35,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=153066.66666666666, ans=0.04949747468305833 2023-10-09 22:36:40,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.50 vs. limit=15.0 2023-10-09 22:36:41,132 INFO [train.py:1031] (0/4) Epoch 3, batch 5500, loss[loss=0.2713, simple_loss=0.3422, pruned_loss=0.1002, over 15518.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.342, pruned_loss=0.09748, over 30701569.60 frames. ], batch size: 35, lr: 1.42e-02, grad_scale: 32.0 2023-10-09 22:36:47,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=153113.33333333334, ans=0.015 2023-10-09 22:36:55,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.816e+02 2.162e+02 2.499e+02 3.262e+02, threshold=4.324e+02, percent-clipped=0.0 2023-10-09 22:37:23,230 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.42 vs. limit=10.0 2023-10-09 22:37:50,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=153393.33333333334, ans=0.2 2023-10-09 22:37:52,595 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.15 vs. limit=15.0 2023-10-09 22:37:54,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=153393.33333333334, ans=0.125 2023-10-09 22:37:55,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=153440.0, ans=0.2 2023-10-09 22:37:57,871 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.59 vs. limit=15.0 2023-10-09 22:38:10,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=153486.66666666666, ans=0.2 2023-10-09 22:38:13,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.63 vs. limit=15.0 2023-10-09 22:38:42,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.855e+02 2.043e+02 2.293e+02 3.025e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-09 22:38:43,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=153626.66666666666, ans=0.125 2023-10-09 22:38:43,459 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.54 vs. limit=15.0 2023-10-09 22:38:46,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=153626.66666666666, ans=0.125 2023-10-09 22:39:17,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=153766.66666666666, ans=0.0 2023-10-09 22:39:26,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=153813.33333333334, ans=0.125 2023-10-09 22:39:28,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=153813.33333333334, ans=0.05 2023-10-09 22:39:32,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=153860.0, ans=0.2 2023-10-09 22:39:32,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=153860.0, ans=0.0 2023-10-09 22:39:33,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=153860.0, ans=0.2 2023-10-09 22:39:41,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=153860.0, ans=0.2 2023-10-09 22:39:52,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=153906.66666666666, ans=0.125 2023-10-09 22:39:52,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=153906.66666666666, ans=0.2 2023-10-09 22:40:01,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=153953.33333333334, ans=0.125 2023-10-09 22:40:10,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154000.0, ans=0.1 2023-10-09 22:40:18,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=154000.0, ans=0.125 2023-10-09 22:40:28,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-10-09 22:40:28,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=154046.66666666666, ans=0.0 2023-10-09 22:40:37,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.841e+02 2.148e+02 2.341e+02 3.683e+02, threshold=4.296e+02, percent-clipped=0.0 2023-10-09 22:40:54,470 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:41:01,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=154186.66666666666, ans=0.125 2023-10-09 22:41:05,998 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:41:29,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=154326.66666666666, ans=0.07 2023-10-09 22:41:56,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154420.0, ans=0.1 2023-10-09 22:42:00,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=154420.0, ans=0.2 2023-10-09 22:42:31,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=154560.0, ans=0.125 2023-10-09 22:42:31,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=154560.0, ans=0.2 2023-10-09 22:42:32,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.846e+02 2.006e+02 2.394e+02 3.432e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-09 22:42:33,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=154560.0, ans=0.2 2023-10-09 22:42:52,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154653.33333333334, ans=0.1 2023-10-09 22:43:06,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=22.5 2023-10-09 22:43:14,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=154746.66666666666, ans=0.125 2023-10-09 22:43:14,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=154746.66666666666, ans=0.09899494936611666 2023-10-09 22:43:27,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=154793.33333333334, ans=15.0 2023-10-09 22:43:29,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2023-10-09 22:43:30,744 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.94 vs. limit=15.0 2023-10-09 22:43:37,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=154840.0, ans=0.125 2023-10-09 22:43:48,133 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.02 vs. limit=6.0 2023-10-09 22:44:09,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=154933.33333333334, ans=0.0 2023-10-09 22:44:26,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.748e+02 1.963e+02 2.215e+02 3.130e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-09 22:44:28,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=155026.66666666666, ans=0.0 2023-10-09 22:44:35,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155073.33333333334, ans=0.1 2023-10-09 22:44:48,027 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.96 vs. limit=15.0 2023-10-09 22:44:48,565 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:44:50,865 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.80 vs. limit=12.0 2023-10-09 22:45:06,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=155166.66666666666, ans=0.125 2023-10-09 22:45:37,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=155306.66666666666, ans=0.0 2023-10-09 22:45:55,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=155400.0, ans=0.0 2023-10-09 22:46:04,093 INFO [train.py:1031] (0/4) Epoch 3, batch 6000, loss[loss=0.2663, simple_loss=0.3455, pruned_loss=0.09355, over 16989.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3414, pruned_loss=0.09695, over 31167152.47 frames. ], batch size: 123, lr: 1.41e-02, grad_scale: 64.0 2023-10-09 22:46:09,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=155446.66666666666, ans=0.125 2023-10-09 22:46:12,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=155446.66666666666, ans=0.125 2023-10-09 22:46:20,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.913e+02 2.142e+02 2.403e+02 3.382e+02, threshold=4.283e+02, percent-clipped=0.0 2023-10-09 22:46:22,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=155493.33333333334, ans=0.04949747468305833 2023-10-09 22:46:22,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=155493.33333333334, ans=0.125 2023-10-09 22:46:24,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=155493.33333333334, ans=0.025 2023-10-09 22:46:38,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=155586.66666666666, ans=0.125 2023-10-09 22:46:38,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=155586.66666666666, ans=0.1 2023-10-09 22:47:24,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=155773.33333333334, ans=0.125 2023-10-09 22:47:35,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155820.0, ans=0.1 2023-10-09 22:48:05,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=155960.0, ans=10.0 2023-10-09 22:48:08,423 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.821e+02 2.040e+02 2.331e+02 3.202e+02, threshold=4.080e+02, percent-clipped=0.0 2023-10-09 22:48:17,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=156006.66666666666, ans=0.125 2023-10-09 22:48:26,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=156053.33333333334, ans=0.125 2023-10-09 22:48:33,088 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-10-09 22:48:54,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.48 vs. limit=15.0 2023-10-09 22:49:17,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=156240.0, ans=0.0 2023-10-09 22:49:20,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156286.66666666666, ans=0.1 2023-10-09 22:49:33,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=156333.33333333334, ans=0.0 2023-10-09 22:49:37,088 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 22:49:52,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=156380.0, ans=0.125 2023-10-09 22:49:58,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.870e+02 2.048e+02 2.337e+02 3.272e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-09 22:50:00,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=156426.66666666666, ans=0.0 2023-10-09 22:50:13,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=156473.33333333334, ans=0.1 2023-10-09 22:50:13,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.74 vs. limit=15.0 2023-10-09 22:50:20,736 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.92 vs. limit=15.0 2023-10-09 22:50:29,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=156566.66666666666, ans=0.125 2023-10-09 22:50:48,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=156660.0, ans=0.125 2023-10-09 22:50:50,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=156660.0, ans=0.125 2023-10-09 22:50:51,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=156660.0, ans=0.125 2023-10-09 22:50:59,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=156660.0, ans=0.125 2023-10-09 22:51:01,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=156706.66666666666, ans=0.125 2023-10-09 22:51:06,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=156706.66666666666, ans=0.0 2023-10-09 22:51:21,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=156753.33333333334, ans=0.125 2023-10-09 22:51:47,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.07 vs. limit=15.0 2023-10-09 22:51:53,633 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.832e+02 2.055e+02 2.263e+02 4.156e+02, threshold=4.110e+02, percent-clipped=1.0 2023-10-09 22:51:56,187 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=12.0 2023-10-09 22:51:56,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=156893.33333333334, ans=0.0 2023-10-09 22:52:16,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.31 vs. limit=15.0 2023-10-09 22:52:23,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-10-09 22:52:46,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=157126.66666666666, ans=0.0 2023-10-09 22:52:52,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.67 vs. limit=15.0 2023-10-09 22:52:58,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=157126.66666666666, ans=0.02 2023-10-09 22:53:06,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=157173.33333333334, ans=0.125 2023-10-09 22:53:21,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157220.0, ans=0.1 2023-10-09 22:53:28,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=157266.66666666666, ans=0.125 2023-10-09 22:53:39,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=157313.33333333334, ans=10.0 2023-10-09 22:53:39,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=157313.33333333334, ans=0.0 2023-10-09 22:53:39,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=157313.33333333334, ans=0.125 2023-10-09 22:53:54,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.241e+02 1.853e+02 2.024e+02 2.504e+02 3.451e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-09 22:53:56,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=157360.0, ans=0.125 2023-10-09 22:53:56,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=12.0 2023-10-09 22:54:22,363 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.35 vs. limit=15.0 2023-10-09 22:54:33,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=157500.0, ans=0.0 2023-10-09 22:54:37,468 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.38 vs. limit=15.0 2023-10-09 22:54:43,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=157546.66666666666, ans=0.125 2023-10-09 22:54:43,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=157546.66666666666, ans=0.2 2023-10-09 22:54:51,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=157593.33333333334, ans=0.125 2023-10-09 22:55:10,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=157686.66666666666, ans=0.125 2023-10-09 22:55:16,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157686.66666666666, ans=0.1 2023-10-09 22:55:21,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157733.33333333334, ans=0.1 2023-10-09 22:55:25,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2023-10-09 22:55:32,395 INFO [train.py:1031] (0/4) Epoch 3, batch 6500, loss[loss=0.2741, simple_loss=0.3532, pruned_loss=0.09752, over 16848.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3417, pruned_loss=0.09704, over 31505222.93 frames. ], batch size: 155, lr: 1.40e-02, grad_scale: 32.0 2023-10-09 22:55:38,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=157780.0, ans=0.0 2023-10-09 22:55:39,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=157780.0, ans=10.0 2023-10-09 22:55:52,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=157826.66666666666, ans=0.125 2023-10-09 22:55:53,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.814e+02 2.030e+02 2.227e+02 3.213e+02, threshold=4.061e+02, percent-clipped=0.0 2023-10-09 22:56:10,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=157873.33333333334, ans=0.125 2023-10-09 22:56:23,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=157920.0, ans=0.0 2023-10-09 22:56:44,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=158013.33333333334, ans=0.125 2023-10-09 22:57:03,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=158106.66666666666, ans=0.125 2023-10-09 22:57:31,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=158200.0, ans=0.125 2023-10-09 22:57:33,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=158200.0, ans=0.0 2023-10-09 22:57:40,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=158246.66666666666, ans=0.125 2023-10-09 22:57:45,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=158246.66666666666, ans=0.125 2023-10-09 22:57:56,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=158293.33333333334, ans=0.125 2023-10-09 22:57:57,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.791e+02 2.039e+02 2.436e+02 4.084e+02, threshold=4.078e+02, percent-clipped=1.0 2023-10-09 22:57:59,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=158293.33333333334, ans=0.125 2023-10-09 22:58:35,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=158480.0, ans=0.125 2023-10-09 22:58:38,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=6.0 2023-10-09 22:58:38,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=158480.0, ans=0.0 2023-10-09 22:58:54,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=158573.33333333334, ans=0.125 2023-10-09 22:59:21,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=158666.66666666666, ans=0.2 2023-10-09 22:59:26,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=158713.33333333334, ans=0.0 2023-10-09 22:59:31,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=158713.33333333334, ans=0.0 2023-10-09 22:59:43,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.877e+02 2.135e+02 2.495e+02 3.687e+02, threshold=4.269e+02, percent-clipped=0.0 2023-10-09 23:00:00,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.77 vs. limit=15.0 2023-10-09 23:00:13,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=158900.0, ans=0.125 2023-10-09 23:00:41,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=158993.33333333334, ans=0.0 2023-10-09 23:00:47,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.87 vs. limit=22.5 2023-10-09 23:00:53,342 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-10-09 23:00:59,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159086.66666666666, ans=0.0 2023-10-09 23:01:02,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159086.66666666666, ans=0.0 2023-10-09 23:01:07,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=159086.66666666666, ans=0.125 2023-10-09 23:01:19,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=159133.33333333334, ans=0.2 2023-10-09 23:01:50,664 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.831e+02 2.052e+02 2.535e+02 4.887e+02, threshold=4.104e+02, percent-clipped=4.0 2023-10-09 23:01:56,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=159273.33333333334, ans=0.2 2023-10-09 23:02:00,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=159273.33333333334, ans=0.09899494936611666 2023-10-09 23:02:09,256 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.59 vs. limit=15.0 2023-10-09 23:02:22,107 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.83 vs. limit=10.0 2023-10-09 23:02:43,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=159460.0, ans=0.125 2023-10-09 23:02:44,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159460.0, ans=0.1 2023-10-09 23:02:47,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=159460.0, ans=0.2 2023-10-09 23:02:51,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=159460.0, ans=0.125 2023-10-09 23:02:51,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=159460.0, ans=22.5 2023-10-09 23:03:13,399 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=15.0 2023-10-09 23:03:14,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=159553.33333333334, ans=0.2 2023-10-09 23:03:21,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.54 vs. limit=6.0 2023-10-09 23:03:28,010 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.13 vs. limit=22.5 2023-10-09 23:03:35,550 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:03:42,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=159693.33333333334, ans=0.125 2023-10-09 23:03:44,308 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.730e+02 1.923e+02 2.193e+02 3.245e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-09 23:03:48,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159693.33333333334, ans=0.0 2023-10-09 23:03:49,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=159740.0, ans=0.125 2023-10-09 23:04:02,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=159786.66666666666, ans=0.125 2023-10-09 23:04:05,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=159786.66666666666, ans=0.1 2023-10-09 23:04:11,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.39 vs. limit=15.0 2023-10-09 23:04:23,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=159880.0, ans=0.125 2023-10-09 23:04:25,930 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.27 vs. limit=15.0 2023-10-09 23:04:36,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.23 vs. limit=15.0 2023-10-09 23:04:39,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=159926.66666666666, ans=0.0 2023-10-09 23:05:00,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160020.0, ans=0.125 2023-10-09 23:05:09,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=160066.66666666666, ans=0.1 2023-10-09 23:05:16,634 INFO [train.py:1031] (0/4) Epoch 3, batch 7000, loss[loss=0.2583, simple_loss=0.3349, pruned_loss=0.09084, over 16417.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3415, pruned_loss=0.0964, over 31788917.90 frames. ], batch size: 50, lr: 1.39e-02, grad_scale: 32.0 2023-10-09 23:05:18,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=160113.33333333334, ans=0.125 2023-10-09 23:05:36,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.985e+02 2.202e+02 2.515e+02 3.660e+02, threshold=4.404e+02, percent-clipped=0.0 2023-10-09 23:05:42,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160206.66666666666, ans=0.125 2023-10-09 23:05:43,943 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.54 vs. limit=15.0 2023-10-09 23:05:48,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.96 vs. limit=10.0 2023-10-09 23:05:54,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=160253.33333333334, ans=0.125 2023-10-09 23:05:59,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160253.33333333334, ans=0.1 2023-10-09 23:06:06,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=160300.0, ans=0.125 2023-10-09 23:06:18,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-10-09 23:06:20,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=160346.66666666666, ans=0.125 2023-10-09 23:06:39,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=160440.0, ans=0.125 2023-10-09 23:06:52,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=160486.66666666666, ans=0.125 2023-10-09 23:07:04,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=160533.33333333334, ans=0.125 2023-10-09 23:07:16,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=160580.0, ans=0.07 2023-10-09 23:07:25,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.893e+02 2.182e+02 2.571e+02 3.595e+02, threshold=4.365e+02, percent-clipped=0.0 2023-10-09 23:07:29,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-10-09 23:07:31,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=160673.33333333334, ans=0.0 2023-10-09 23:07:53,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=160720.0, ans=0.0 2023-10-09 23:08:01,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160766.66666666666, ans=0.1 2023-10-09 23:08:07,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=160813.33333333334, ans=0.0 2023-10-09 23:08:09,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=160813.33333333334, ans=0.0 2023-10-09 23:08:10,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=160813.33333333334, ans=0.125 2023-10-09 23:08:49,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=161000.0, ans=0.125 2023-10-09 23:08:50,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=161000.0, ans=0.125 2023-10-09 23:08:58,736 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-10-09 23:09:11,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=161046.66666666666, ans=0.2 2023-10-09 23:09:11,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=161046.66666666666, ans=0.125 2023-10-09 23:09:26,023 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.832e+02 2.093e+02 2.381e+02 3.465e+02, threshold=4.186e+02, percent-clipped=0.0 2023-10-09 23:09:46,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=161186.66666666666, ans=0.125 2023-10-09 23:10:25,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=161326.66666666666, ans=0.035 2023-10-09 23:10:28,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=161326.66666666666, ans=0.125 2023-10-09 23:10:43,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=161373.33333333334, ans=0.0 2023-10-09 23:10:48,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=161420.0, ans=0.0 2023-10-09 23:10:50,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.32 vs. limit=15.0 2023-10-09 23:10:54,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=161420.0, ans=0.0 2023-10-09 23:11:15,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=161513.33333333334, ans=0.125 2023-10-09 23:11:26,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.894e+02 2.137e+02 2.627e+02 4.683e+02, threshold=4.275e+02, percent-clipped=2.0 2023-10-09 23:11:27,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=161560.0, ans=0.125 2023-10-09 23:11:34,575 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-10-09 23:11:44,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2023-10-09 23:11:49,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=161653.33333333334, ans=0.0 2023-10-09 23:12:05,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=161700.0, ans=0.125 2023-10-09 23:12:18,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=161793.33333333334, ans=0.125 2023-10-09 23:12:24,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=161793.33333333334, ans=0.125 2023-10-09 23:12:28,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=161840.0, ans=0.125 2023-10-09 23:12:58,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=161933.33333333334, ans=0.09899494936611666 2023-10-09 23:13:16,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.873e+02 2.144e+02 2.446e+02 4.057e+02, threshold=4.289e+02, percent-clipped=0.0 2023-10-09 23:13:25,389 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.80 vs. limit=15.0 2023-10-09 23:13:33,846 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:13:59,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=162213.33333333334, ans=0.0 2023-10-09 23:14:15,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=162306.66666666666, ans=0.0 2023-10-09 23:14:20,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=162306.66666666666, ans=0.1 2023-10-09 23:14:31,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=162353.33333333334, ans=0.125 2023-10-09 23:14:31,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=162353.33333333334, ans=0.2 2023-10-09 23:14:32,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=162353.33333333334, ans=0.125 2023-10-09 23:14:32,741 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-10-09 23:14:35,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=162400.0, ans=0.125 2023-10-09 23:14:47,197 INFO [train.py:1031] (0/4) Epoch 3, batch 7500, loss[loss=0.2858, simple_loss=0.356, pruned_loss=0.1078, over 16905.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.341, pruned_loss=0.09624, over 31987119.91 frames. ], batch size: 110, lr: 1.38e-02, grad_scale: 32.0 2023-10-09 23:15:02,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=162493.33333333334, ans=0.125 2023-10-09 23:15:03,900 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.948e+02 2.180e+02 2.537e+02 3.314e+02, threshold=4.359e+02, percent-clipped=0.0 2023-10-09 23:15:08,536 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:15:17,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=162540.0, ans=0.0 2023-10-09 23:15:28,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=162586.66666666666, ans=0.0 2023-10-09 23:15:35,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=162633.33333333334, ans=0.2 2023-10-09 23:15:38,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=162633.33333333334, ans=0.0 2023-10-09 23:16:26,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=162820.0, ans=0.0 2023-10-09 23:16:32,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=162866.66666666666, ans=0.0 2023-10-09 23:16:43,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=162913.33333333334, ans=0.125 2023-10-09 23:16:50,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=162960.0, ans=15.0 2023-10-09 23:16:55,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.819e+02 2.026e+02 2.244e+02 2.971e+02, threshold=4.053e+02, percent-clipped=0.0 2023-10-09 23:16:55,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=162960.0, ans=0.0 2023-10-09 23:16:59,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=163006.66666666666, ans=0.125 2023-10-09 23:17:07,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=163006.66666666666, ans=0.0 2023-10-09 23:17:12,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=163006.66666666666, ans=0.025 2023-10-09 23:17:45,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=163100.0, ans=0.0 2023-10-09 23:17:50,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=163146.66666666666, ans=0.125 2023-10-09 23:18:02,691 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.84 vs. limit=10.0 2023-10-09 23:18:28,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=163286.66666666666, ans=0.0 2023-10-09 23:18:29,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=163286.66666666666, ans=0.1 2023-10-09 23:18:40,292 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.94 vs. limit=6.0 2023-10-09 23:18:59,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.831e+02 2.034e+02 2.380e+02 3.465e+02, threshold=4.068e+02, percent-clipped=0.0 2023-10-09 23:19:05,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=163473.33333333334, ans=0.125 2023-10-09 23:19:05,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=163473.33333333334, ans=0.125 2023-10-09 23:19:08,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=163473.33333333334, ans=0.125 2023-10-09 23:19:18,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=163520.0, ans=0.0 2023-10-09 23:19:34,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=163566.66666666666, ans=0.125 2023-10-09 23:20:12,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.75 vs. limit=10.0 2023-10-09 23:20:41,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=163893.33333333334, ans=0.125 2023-10-09 23:20:45,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=163893.33333333334, ans=0.1 2023-10-09 23:20:46,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=163893.33333333334, ans=0.2 2023-10-09 23:20:47,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.883e+02 2.113e+02 2.504e+02 3.350e+02, threshold=4.226e+02, percent-clipped=0.0 2023-10-09 23:20:56,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=163940.0, ans=0.0 2023-10-09 23:21:02,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=163940.0, ans=0.125 2023-10-09 23:21:08,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=163986.66666666666, ans=0.0 2023-10-09 23:21:20,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.18 vs. limit=22.5 2023-10-09 23:21:22,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=164033.33333333334, ans=0.2 2023-10-09 23:21:22,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=164033.33333333334, ans=0.125 2023-10-09 23:21:28,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.59 vs. limit=22.5 2023-10-09 23:21:41,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=164080.0, ans=0.125 2023-10-09 23:21:51,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=164126.66666666666, ans=0.125 2023-10-09 23:22:05,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=164173.33333333334, ans=0.125 2023-10-09 23:22:29,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=164313.33333333334, ans=0.0 2023-10-09 23:22:33,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=164313.33333333334, ans=0.125 2023-10-09 23:22:39,236 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.89 vs. limit=15.0 2023-10-09 23:22:45,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=164360.0, ans=0.125 2023-10-09 23:22:46,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.816e+02 2.018e+02 2.338e+02 3.283e+02, threshold=4.036e+02, percent-clipped=0.0 2023-10-09 23:23:05,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=164453.33333333334, ans=0.2 2023-10-09 23:23:29,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=164546.66666666666, ans=0.125 2023-10-09 23:23:38,818 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.18 vs. limit=15.0 2023-10-09 23:23:40,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=164593.33333333334, ans=0.125 2023-10-09 23:23:57,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=164640.0, ans=0.125 2023-10-09 23:24:09,642 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.03 vs. limit=22.5 2023-10-09 23:24:14,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=164733.33333333334, ans=0.2 2023-10-09 23:24:15,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=164733.33333333334, ans=0.0 2023-10-09 23:24:19,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=164733.33333333334, ans=0.125 2023-10-09 23:24:24,741 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-10-09 23:24:25,172 INFO [train.py:1031] (0/4) Epoch 3, batch 8000, loss[loss=0.2667, simple_loss=0.3468, pruned_loss=0.09333, over 16938.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3398, pruned_loss=0.09512, over 32183536.66 frames. ], batch size: 138, lr: 1.37e-02, grad_scale: 32.0 2023-10-09 23:24:27,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=164780.0, ans=0.125 2023-10-09 23:24:38,950 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=12.0 2023-10-09 23:24:40,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.67 vs. limit=15.0 2023-10-09 23:24:42,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.917e+02 2.164e+02 2.596e+02 4.440e+02, threshold=4.328e+02, percent-clipped=5.0 2023-10-09 23:24:55,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=164873.33333333334, ans=0.0 2023-10-09 23:25:10,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=164966.66666666666, ans=0.125 2023-10-09 23:25:23,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=165013.33333333334, ans=0.0 2023-10-09 23:25:25,683 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-10-09 23:25:33,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.24 vs. limit=15.0 2023-10-09 23:25:39,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=165060.0, ans=0.125 2023-10-09 23:26:02,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=165200.0, ans=0.0 2023-10-09 23:26:06,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=165200.0, ans=0.0 2023-10-09 23:26:09,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=165200.0, ans=0.1 2023-10-09 23:26:13,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=165246.66666666666, ans=0.0 2023-10-09 23:26:25,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=12.0 2023-10-09 23:26:27,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=165293.33333333334, ans=0.125 2023-10-09 23:26:27,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.827e+02 2.038e+02 2.280e+02 3.156e+02, threshold=4.076e+02, percent-clipped=0.0 2023-10-09 23:26:37,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=165340.0, ans=0.2 2023-10-09 23:26:55,558 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.74 vs. limit=22.5 2023-10-09 23:27:34,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=165526.66666666666, ans=0.125 2023-10-09 23:27:34,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=165526.66666666666, ans=0.125 2023-10-09 23:27:39,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=165526.66666666666, ans=0.2 2023-10-09 23:28:18,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=165713.33333333334, ans=0.125 2023-10-09 23:28:37,977 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.853e+02 2.158e+02 2.601e+02 3.634e+02, threshold=4.317e+02, percent-clipped=0.0 2023-10-09 23:28:39,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=165760.0, ans=0.0 2023-10-09 23:28:44,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.97 vs. limit=15.0 2023-10-09 23:28:46,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=165806.66666666666, ans=0.125 2023-10-09 23:28:51,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=165806.66666666666, ans=0.2 2023-10-09 23:28:59,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=165853.33333333334, ans=0.125 2023-10-09 23:29:16,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=165900.0, ans=0.125 2023-10-09 23:29:34,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=165993.33333333334, ans=0.1 2023-10-09 23:29:39,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=166040.0, ans=0.125 2023-10-09 23:29:46,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166040.0, ans=0.1 2023-10-09 23:29:52,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=166086.66666666666, ans=0.125 2023-10-09 23:29:58,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=166086.66666666666, ans=0.0 2023-10-09 23:30:06,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=166133.33333333334, ans=10.0 2023-10-09 23:30:10,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166133.33333333334, ans=0.1 2023-10-09 23:30:13,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=166133.33333333334, ans=0.1 2023-10-09 23:30:17,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.86 vs. limit=15.0 2023-10-09 23:30:25,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=166180.0, ans=0.0 2023-10-09 23:30:32,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=166226.66666666666, ans=0.125 2023-10-09 23:30:34,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.811e+02 2.053e+02 2.369e+02 4.268e+02, threshold=4.106e+02, percent-clipped=0.0 2023-10-09 23:31:00,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=166320.0, ans=0.2 2023-10-09 23:31:03,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166366.66666666666, ans=0.1 2023-10-09 23:31:07,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=166366.66666666666, ans=0.125 2023-10-09 23:31:33,972 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-10-09 23:31:38,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=166506.66666666666, ans=0.0 2023-10-09 23:31:43,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=166506.66666666666, ans=10.0 2023-10-09 23:31:56,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=166553.33333333334, ans=0.0 2023-10-09 23:31:59,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=166600.0, ans=0.125 2023-10-09 23:32:01,176 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.06 vs. limit=22.5 2023-10-09 23:32:03,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=166600.0, ans=0.1 2023-10-09 23:32:04,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.60 vs. limit=6.0 2023-10-09 23:32:08,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=166600.0, ans=0.125 2023-10-09 23:32:23,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166693.33333333334, ans=0.1 2023-10-09 23:32:29,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.821e+02 2.066e+02 2.325e+02 3.735e+02, threshold=4.133e+02, percent-clipped=0.0 2023-10-09 23:32:31,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=166693.33333333334, ans=0.125 2023-10-09 23:32:39,251 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:32:42,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=166740.0, ans=0.1 2023-10-09 23:32:49,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166786.66666666666, ans=0.1 2023-10-09 23:33:04,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=166833.33333333334, ans=0.125 2023-10-09 23:33:13,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=166880.0, ans=22.5 2023-10-09 23:33:16,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=166880.0, ans=0.125 2023-10-09 23:33:18,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.68 vs. limit=15.0 2023-10-09 23:33:26,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=166926.66666666666, ans=0.2 2023-10-09 23:33:43,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=167020.0, ans=0.1 2023-10-09 23:33:56,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=167066.66666666666, ans=0.2 2023-10-09 23:34:09,657 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-10-09 23:34:10,977 INFO [train.py:1031] (0/4) Epoch 3, batch 8500, loss[loss=0.3175, simple_loss=0.3756, pruned_loss=0.1297, over 16659.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3395, pruned_loss=0.09456, over 32339057.54 frames. ], batch size: 202, lr: 1.36e-02, grad_scale: 64.0 2023-10-09 23:34:24,800 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.01 vs. limit=15.0 2023-10-09 23:34:26,071 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.893e+02 2.174e+02 2.445e+02 4.365e+02, threshold=4.348e+02, percent-clipped=1.0 2023-10-09 23:34:28,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167160.0, ans=0.1 2023-10-09 23:34:31,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=167206.66666666666, ans=0.1 2023-10-09 23:34:33,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=167206.66666666666, ans=0.04949747468305833 2023-10-09 23:34:58,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=167300.0, ans=0.125 2023-10-09 23:35:07,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=167346.66666666666, ans=0.07 2023-10-09 23:35:07,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=167346.66666666666, ans=0.2 2023-10-09 23:35:21,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=167393.33333333334, ans=0.125 2023-10-09 23:35:23,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=167393.33333333334, ans=0.1 2023-10-09 23:35:32,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=167440.0, ans=0.2 2023-10-09 23:35:42,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167486.66666666666, ans=0.1 2023-10-09 23:35:56,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167533.33333333334, ans=0.1 2023-10-09 23:35:57,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=167533.33333333334, ans=0.0 2023-10-09 23:36:26,708 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.873e+02 2.050e+02 2.338e+02 3.728e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-09 23:36:34,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=167673.33333333334, ans=0.2 2023-10-09 23:36:42,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=167720.0, ans=0.1 2023-10-09 23:36:55,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=167766.66666666666, ans=0.2 2023-10-09 23:37:03,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167766.66666666666, ans=0.1 2023-10-09 23:37:12,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=167813.33333333334, ans=0.025 2023-10-09 23:37:13,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=167813.33333333334, ans=0.125 2023-10-09 23:37:21,907 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.87 vs. limit=8.0 2023-10-09 23:37:29,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=167860.0, ans=0.125 2023-10-09 23:37:38,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=167906.66666666666, ans=0.2 2023-10-09 23:37:43,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=167906.66666666666, ans=0.125 2023-10-09 23:37:44,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=167906.66666666666, ans=0.0 2023-10-09 23:37:44,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=167906.66666666666, ans=0.125 2023-10-09 23:37:50,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=167953.33333333334, ans=0.125 2023-10-09 23:38:07,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.89 vs. limit=6.0 2023-10-09 23:38:10,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168000.0, ans=0.1 2023-10-09 23:38:11,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=168046.66666666666, ans=0.125 2023-10-09 23:38:17,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=168046.66666666666, ans=0.1 2023-10-09 23:38:28,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.684e+02 1.926e+02 2.095e+02 3.028e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-09 23:38:46,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=168186.66666666666, ans=0.5 2023-10-09 23:38:48,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=168186.66666666666, ans=0.1 2023-10-09 23:38:51,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=168186.66666666666, ans=0.125 2023-10-09 23:38:51,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=168186.66666666666, ans=0.125 2023-10-09 23:38:53,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=168186.66666666666, ans=0.125 2023-10-09 23:38:55,011 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-10-09 23:39:26,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.43 vs. limit=22.5 2023-10-09 23:39:33,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=168326.66666666666, ans=0.125 2023-10-09 23:39:38,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=168373.33333333334, ans=0.2 2023-10-09 23:39:43,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=168373.33333333334, ans=0.125 2023-10-09 23:39:58,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=168420.0, ans=0.0 2023-10-09 23:39:58,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-10-09 23:40:05,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=168466.66666666666, ans=0.0 2023-10-09 23:40:22,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=15.0 2023-10-09 23:40:28,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.684e+02 1.957e+02 2.216e+02 3.140e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-09 23:40:30,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=168560.0, ans=0.0 2023-10-09 23:40:48,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=168653.33333333334, ans=0.125 2023-10-09 23:40:50,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=168653.33333333334, ans=0.0 2023-10-09 23:40:54,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=168700.0, ans=0.125 2023-10-09 23:41:13,729 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.65 vs. limit=15.0 2023-10-09 23:41:26,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.91 vs. limit=10.0 2023-10-09 23:41:29,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.50 vs. limit=15.0 2023-10-09 23:41:33,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2023-10-09 23:41:36,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=168886.66666666666, ans=0.0 2023-10-09 23:41:38,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=168886.66666666666, ans=0.125 2023-10-09 23:42:00,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=168980.0, ans=0.09899494936611666 2023-10-09 23:42:07,200 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.34 vs. limit=15.0 2023-10-09 23:42:13,011 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:42:17,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.841e+02 2.061e+02 2.419e+02 3.298e+02, threshold=4.122e+02, percent-clipped=0.0 2023-10-09 23:42:20,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=169073.33333333334, ans=0.125 2023-10-09 23:42:25,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=169073.33333333334, ans=0.0 2023-10-09 23:42:29,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=169073.33333333334, ans=0.125 2023-10-09 23:42:38,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=169120.0, ans=0.125 2023-10-09 23:42:48,469 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:43:14,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=169306.66666666666, ans=0.125 2023-10-09 23:43:30,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.60 vs. limit=12.0 2023-10-09 23:43:31,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=169353.33333333334, ans=0.125 2023-10-09 23:43:39,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=169400.0, ans=0.0 2023-10-09 23:43:48,848 INFO [train.py:1031] (0/4) Epoch 3, batch 9000, loss[loss=0.2752, simple_loss=0.3547, pruned_loss=0.09791, over 16880.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3385, pruned_loss=0.09411, over 32438224.83 frames. ], batch size: 130, lr: 1.35e-02, grad_scale: 32.0 2023-10-09 23:44:06,274 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.971e+02 2.282e+02 2.661e+02 3.968e+02, threshold=4.563e+02, percent-clipped=0.0 2023-10-09 23:44:09,142 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2023-10-09 23:44:13,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=169540.0, ans=0.07 2023-10-09 23:44:22,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=169586.66666666666, ans=0.1 2023-10-09 23:44:44,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=169680.0, ans=0.0 2023-10-09 23:44:52,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=169726.66666666666, ans=0.125 2023-10-09 23:45:02,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=169773.33333333334, ans=0.125 2023-10-09 23:45:08,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=169773.33333333334, ans=0.125 2023-10-09 23:45:10,777 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.27 vs. limit=22.5 2023-10-09 23:45:11,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=169773.33333333334, ans=0.0 2023-10-09 23:45:13,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=169820.0, ans=0.035 2023-10-09 23:45:13,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169820.0, ans=0.1 2023-10-09 23:45:23,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.17 vs. limit=15.0 2023-10-09 23:45:35,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=169913.33333333334, ans=0.125 2023-10-09 23:45:41,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=169913.33333333334, ans=0.04949747468305833 2023-10-09 23:45:47,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=169960.0, ans=0.125 2023-10-09 23:45:51,503 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.959e+02 2.175e+02 2.461e+02 2.986e+02, threshold=4.350e+02, percent-clipped=0.0 2023-10-09 23:45:57,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=170006.66666666666, ans=0.2 2023-10-09 23:45:59,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=170006.66666666666, ans=0.125 2023-10-09 23:46:06,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.73 vs. limit=15.0 2023-10-09 23:46:27,058 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.36 vs. limit=15.0 2023-10-09 23:46:29,284 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:46:37,357 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:46:46,182 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-09 23:46:50,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=170240.0, ans=0.0 2023-10-09 23:46:52,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=170240.0, ans=0.125 2023-10-09 23:46:56,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=170240.0, ans=0.0 2023-10-09 23:47:07,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=170286.66666666666, ans=0.125 2023-10-09 23:47:15,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=170333.33333333334, ans=0.0 2023-10-09 23:47:23,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=170380.0, ans=0.125 2023-10-09 23:47:38,099 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.68 vs. limit=22.5 2023-10-09 23:47:38,313 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.872e+02 2.156e+02 2.542e+02 3.313e+02, threshold=4.312e+02, percent-clipped=0.0 2023-10-09 23:47:41,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=170473.33333333334, ans=0.0 2023-10-09 23:47:45,177 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.41 vs. limit=12.0 2023-10-09 23:47:45,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=170473.33333333334, ans=0.0 2023-10-09 23:47:49,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=170473.33333333334, ans=0.04949747468305833 2023-10-09 23:47:56,726 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.97 vs. limit=6.0 2023-10-09 23:47:59,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=170520.0, ans=0.125 2023-10-09 23:48:20,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=170613.33333333334, ans=0.125 2023-10-09 23:48:28,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=170660.0, ans=0.2 2023-10-09 23:48:38,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=170706.66666666666, ans=0.0 2023-10-09 23:48:59,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.97 vs. limit=22.5 2023-10-09 23:49:02,066 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.15 vs. limit=10.0 2023-10-09 23:49:06,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=170800.0, ans=0.0 2023-10-09 23:49:26,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.845e+02 2.034e+02 2.324e+02 3.280e+02, threshold=4.068e+02, percent-clipped=0.0 2023-10-09 23:49:27,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=170893.33333333334, ans=0.07 2023-10-09 23:49:31,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=170940.0, ans=0.125 2023-10-09 23:49:39,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=170940.0, ans=0.125 2023-10-09 23:49:45,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.84 vs. limit=6.0 2023-10-09 23:50:20,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=171126.66666666666, ans=0.0 2023-10-09 23:50:39,413 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.11 vs. limit=22.5 2023-10-09 23:51:08,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=171266.66666666666, ans=0.125 2023-10-09 23:51:13,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=171313.33333333334, ans=0.0 2023-10-09 23:51:15,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=171313.33333333334, ans=0.1 2023-10-09 23:51:16,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=171313.33333333334, ans=0.125 2023-10-09 23:51:21,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=171313.33333333334, ans=0.0 2023-10-09 23:51:22,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=171313.33333333334, ans=0.125 2023-10-09 23:51:31,315 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.888e+02 2.123e+02 2.427e+02 4.332e+02, threshold=4.245e+02, percent-clipped=1.0 2023-10-09 23:51:40,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=171406.66666666666, ans=0.125 2023-10-09 23:52:06,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=171500.0, ans=0.0 2023-10-09 23:52:10,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=171546.66666666666, ans=0.0 2023-10-09 23:52:11,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=171546.66666666666, ans=0.0 2023-10-09 23:52:14,522 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2023-10-09 23:52:17,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.55 vs. limit=15.0 2023-10-09 23:52:21,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171593.33333333334, ans=0.1 2023-10-09 23:52:24,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=171593.33333333334, ans=10.0 2023-10-09 23:52:57,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=171733.33333333334, ans=0.125 2023-10-09 23:53:09,441 INFO [train.py:1031] (0/4) Epoch 3, batch 9500, loss[loss=0.2561, simple_loss=0.3353, pruned_loss=0.08843, over 16465.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.339, pruned_loss=0.09432, over 32505260.61 frames. ], batch size: 266, lr: 1.34e-02, grad_scale: 32.0 2023-10-09 23:53:14,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=171780.0, ans=0.125 2023-10-09 23:53:22,733 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.14 vs. limit=15.0 2023-10-09 23:53:28,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.802e+02 2.025e+02 2.317e+02 4.130e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-09 23:53:32,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.19 vs. limit=15.0 2023-10-09 23:53:35,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=171873.33333333334, ans=0.0 2023-10-09 23:54:03,394 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=22.5 2023-10-09 23:54:06,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=172013.33333333334, ans=0.125 2023-10-09 23:54:18,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=172060.0, ans=0.125 2023-10-09 23:54:19,255 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-10-09 23:54:29,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=172106.66666666666, ans=0.125 2023-10-09 23:54:39,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=172153.33333333334, ans=0.125 2023-10-09 23:54:51,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=172200.0, ans=0.015 2023-10-09 23:55:07,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=172246.66666666666, ans=0.125 2023-10-09 23:55:20,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.828e+02 2.016e+02 2.337e+02 3.369e+02, threshold=4.031e+02, percent-clipped=0.0 2023-10-09 23:55:24,490 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.98 vs. limit=12.0 2023-10-09 23:55:25,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=172340.0, ans=0.0 2023-10-09 23:55:29,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=172340.0, ans=0.2 2023-10-09 23:55:40,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=172386.66666666666, ans=0.125 2023-10-09 23:55:54,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=172433.33333333334, ans=0.04949747468305833 2023-10-09 23:56:29,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=172573.33333333334, ans=0.09899494936611666 2023-10-09 23:56:47,642 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.52 vs. limit=22.5 2023-10-09 23:56:48,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.66 vs. limit=22.5 2023-10-09 23:56:56,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=172713.33333333334, ans=0.0 2023-10-09 23:56:58,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.79 vs. limit=15.0 2023-10-09 23:57:02,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=172713.33333333334, ans=0.125 2023-10-09 23:57:13,369 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.768e+02 1.962e+02 2.260e+02 3.269e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-09 23:57:13,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=172760.0, ans=0.125 2023-10-09 23:57:19,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172806.66666666666, ans=0.1 2023-10-09 23:57:28,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=172853.33333333334, ans=0.0 2023-10-09 23:57:54,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.77 vs. limit=22.5 2023-10-09 23:58:08,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.92 vs. limit=6.0 2023-10-09 23:58:27,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=173086.66666666666, ans=0.125 2023-10-09 23:58:52,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=173180.0, ans=0.0 2023-10-09 23:58:57,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=173180.0, ans=0.125 2023-10-09 23:59:06,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.894e+02 2.152e+02 2.500e+02 3.510e+02, threshold=4.303e+02, percent-clipped=0.0 2023-10-09 23:59:32,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=173366.66666666666, ans=0.07 2023-10-09 23:59:47,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-10-10 00:00:11,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=173506.66666666666, ans=0.0 2023-10-10 00:00:27,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=173600.0, ans=0.125 2023-10-10 00:00:39,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=173646.66666666666, ans=0.125 2023-10-10 00:00:46,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=173646.66666666666, ans=0.0 2023-10-10 00:00:49,808 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.35 vs. limit=15.0 2023-10-10 00:00:55,182 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.840e+02 1.963e+02 2.343e+02 3.564e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-10 00:00:59,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=173740.0, ans=0.125 2023-10-10 00:01:03,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=173740.0, ans=0.125 2023-10-10 00:01:09,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=173786.66666666666, ans=0.04949747468305833 2023-10-10 00:01:38,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-10-10 00:01:38,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.37 vs. limit=15.0 2023-10-10 00:01:57,227 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:01:59,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=173973.33333333334, ans=0.0 2023-10-10 00:02:04,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174020.0, ans=0.1 2023-10-10 00:02:12,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=174020.0, ans=0.125 2023-10-10 00:02:14,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=174066.66666666666, ans=0.125 2023-10-10 00:02:17,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=174066.66666666666, ans=0.125 2023-10-10 00:02:18,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=174066.66666666666, ans=0.2 2023-10-10 00:02:19,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=174066.66666666666, ans=0.05 2023-10-10 00:02:20,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.18 vs. limit=10.0 2023-10-10 00:02:24,891 INFO [train.py:1031] (0/4) Epoch 3, batch 10000, loss[loss=0.2417, simple_loss=0.3144, pruned_loss=0.08449, over 15887.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3377, pruned_loss=0.09361, over 32549191.54 frames. ], batch size: 43, lr: 1.34e-02, grad_scale: 32.0 2023-10-10 00:02:32,379 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.99 vs. limit=15.0 2023-10-10 00:02:37,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=174160.0, ans=0.0 2023-10-10 00:02:41,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.804e+02 1.964e+02 2.245e+02 3.034e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-10 00:02:49,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.02 vs. limit=15.0 2023-10-10 00:03:08,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=174300.0, ans=0.2 2023-10-10 00:03:25,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.72 vs. limit=12.0 2023-10-10 00:03:51,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=174440.0, ans=0.0 2023-10-10 00:03:55,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=174486.66666666666, ans=0.125 2023-10-10 00:04:07,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174533.33333333334, ans=0.1 2023-10-10 00:04:09,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-10-10 00:04:09,325 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.60 vs. limit=12.0 2023-10-10 00:04:13,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=174533.33333333334, ans=10.0 2023-10-10 00:04:30,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=174626.66666666666, ans=0.0 2023-10-10 00:04:31,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.11 vs. limit=22.5 2023-10-10 00:04:35,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.897e+02 2.043e+02 2.355e+02 3.801e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-10 00:04:37,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=174626.66666666666, ans=0.2 2023-10-10 00:04:44,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=174673.33333333334, ans=0.125 2023-10-10 00:04:49,912 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.22 vs. limit=22.5 2023-10-10 00:05:05,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=174766.66666666666, ans=0.125 2023-10-10 00:05:12,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=174813.33333333334, ans=0.1 2023-10-10 00:05:39,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=174906.66666666666, ans=0.125 2023-10-10 00:05:45,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=174953.33333333334, ans=0.125 2023-10-10 00:05:47,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=174953.33333333334, ans=0.125 2023-10-10 00:05:51,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174953.33333333334, ans=0.1 2023-10-10 00:06:09,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=175046.66666666666, ans=0.125 2023-10-10 00:06:13,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=175046.66666666666, ans=0.0 2023-10-10 00:06:14,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=175046.66666666666, ans=0.95 2023-10-10 00:06:18,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.24 vs. limit=15.0 2023-10-10 00:06:27,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.821e+02 2.000e+02 2.240e+02 2.944e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-10 00:06:28,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=175093.33333333334, ans=0.125 2023-10-10 00:06:33,039 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.85 vs. limit=15.0 2023-10-10 00:07:08,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175280.0, ans=0.1 2023-10-10 00:07:26,187 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.97 vs. limit=15.0 2023-10-10 00:08:00,171 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-10-10 00:08:06,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-10-10 00:08:14,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=175560.0, ans=0.125 2023-10-10 00:08:16,871 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.50 vs. limit=15.0 2023-10-10 00:08:21,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.857e+02 2.268e+02 2.697e+02 4.178e+02, threshold=4.535e+02, percent-clipped=1.0 2023-10-10 00:08:45,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=175653.33333333334, ans=0.07 2023-10-10 00:09:14,004 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-10-10 00:09:45,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=175886.66666666666, ans=0.0 2023-10-10 00:10:18,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.761e+02 1.942e+02 2.207e+02 3.961e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-10 00:10:26,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=176073.33333333334, ans=0.0 2023-10-10 00:10:40,569 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.36 vs. limit=15.0 2023-10-10 00:10:54,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=176166.66666666666, ans=0.125 2023-10-10 00:11:21,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=176260.0, ans=0.0 2023-10-10 00:11:27,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176306.66666666666, ans=0.1 2023-10-10 00:11:28,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=176306.66666666666, ans=0.0 2023-10-10 00:11:40,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=176353.33333333334, ans=0.0 2023-10-10 00:11:46,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=176400.0, ans=0.0 2023-10-10 00:11:54,425 INFO [train.py:1031] (0/4) Epoch 3, batch 10500, loss[loss=0.2371, simple_loss=0.3196, pruned_loss=0.07724, over 16920.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3377, pruned_loss=0.09335, over 32598082.55 frames. ], batch size: 72, lr: 1.33e-02, grad_scale: 32.0 2023-10-10 00:12:09,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=176493.33333333334, ans=0.125 2023-10-10 00:12:10,821 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.809e+02 1.998e+02 2.329e+02 3.126e+02, threshold=3.996e+02, percent-clipped=0.0 2023-10-10 00:12:16,387 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.37 vs. limit=22.5 2023-10-10 00:12:35,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=176633.33333333334, ans=0.125 2023-10-10 00:12:42,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=176633.33333333334, ans=0.015 2023-10-10 00:12:43,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=176633.33333333334, ans=0.0 2023-10-10 00:13:22,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=176773.33333333334, ans=0.125 2023-10-10 00:13:37,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=176820.0, ans=0.07 2023-10-10 00:14:08,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=176960.0, ans=0.035 2023-10-10 00:14:08,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.59 vs. limit=15.0 2023-10-10 00:14:09,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.904e+02 2.144e+02 2.407e+02 3.455e+02, threshold=4.288e+02, percent-clipped=0.0 2023-10-10 00:14:55,714 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:15:24,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=177286.66666666666, ans=6.0 2023-10-10 00:15:29,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=177286.66666666666, ans=0.0 2023-10-10 00:15:29,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=177286.66666666666, ans=0.125 2023-10-10 00:16:04,484 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.757e+02 1.987e+02 2.332e+02 3.150e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-10 00:16:05,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=177426.66666666666, ans=0.0 2023-10-10 00:16:21,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=177520.0, ans=0.0 2023-10-10 00:16:21,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=177520.0, ans=0.125 2023-10-10 00:16:32,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=177566.66666666666, ans=0.0 2023-10-10 00:16:32,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=177566.66666666666, ans=0.0 2023-10-10 00:16:43,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=177613.33333333334, ans=0.125 2023-10-10 00:16:44,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-10-10 00:16:56,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=177660.0, ans=0.1 2023-10-10 00:16:58,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=177660.0, ans=0.125 2023-10-10 00:17:01,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=177660.0, ans=0.125 2023-10-10 00:17:05,907 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:17:17,784 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:17:52,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=177893.33333333334, ans=0.125 2023-10-10 00:17:54,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.948e+02 2.104e+02 2.512e+02 3.303e+02, threshold=4.209e+02, percent-clipped=0.0 2023-10-10 00:18:00,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=177940.0, ans=0.125 2023-10-10 00:18:13,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-10 00:18:14,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=177986.66666666666, ans=15.0 2023-10-10 00:18:21,916 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.03 vs. limit=22.5 2023-10-10 00:18:37,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=178080.0, ans=0.0 2023-10-10 00:19:24,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=178266.66666666666, ans=0.0 2023-10-10 00:19:35,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=178313.33333333334, ans=0.125 2023-10-10 00:19:43,286 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.870e+02 2.079e+02 2.308e+02 3.254e+02, threshold=4.157e+02, percent-clipped=0.0 2023-10-10 00:19:50,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=178406.66666666666, ans=0.125 2023-10-10 00:19:55,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=178406.66666666666, ans=0.125 2023-10-10 00:20:29,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=178546.66666666666, ans=0.0 2023-10-10 00:20:39,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.41 vs. limit=22.5 2023-10-10 00:20:39,861 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=12.0 2023-10-10 00:20:51,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.89 vs. limit=6.0 2023-10-10 00:21:04,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-10-10 00:21:14,376 INFO [train.py:1031] (0/4) Epoch 3, batch 11000, loss[loss=0.265, simple_loss=0.3341, pruned_loss=0.09801, over 16687.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3375, pruned_loss=0.0932, over 32611012.19 frames. ], batch size: 56, lr: 1.32e-02, grad_scale: 64.0 2023-10-10 00:21:15,075 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-10-10 00:21:18,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=12.0 2023-10-10 00:21:31,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.973e+02 2.340e+02 2.706e+02 3.666e+02, threshold=4.680e+02, percent-clipped=0.0 2023-10-10 00:21:52,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=178920.0, ans=0.2 2023-10-10 00:22:00,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=178966.66666666666, ans=0.125 2023-10-10 00:22:01,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=178966.66666666666, ans=0.0 2023-10-10 00:22:40,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.58 vs. limit=15.0 2023-10-10 00:22:59,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=179200.0, ans=0.125 2023-10-10 00:23:02,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=179200.0, ans=0.1 2023-10-10 00:23:31,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.742e+02 1.981e+02 2.277e+02 4.044e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-10 00:23:32,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179293.33333333334, ans=0.1 2023-10-10 00:23:40,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=179340.0, ans=0.0 2023-10-10 00:23:49,719 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-10-10 00:23:50,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=179386.66666666666, ans=0.125 2023-10-10 00:24:16,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-10-10 00:24:49,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=179620.0, ans=0.1 2023-10-10 00:24:54,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179620.0, ans=0.1 2023-10-10 00:25:11,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=179713.33333333334, ans=0.125 2023-10-10 00:25:23,070 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.97 vs. limit=15.0 2023-10-10 00:25:28,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.768e+02 1.925e+02 2.322e+02 4.129e+02, threshold=3.850e+02, percent-clipped=1.0 2023-10-10 00:25:29,383 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:25:32,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=179806.66666666666, ans=0.0 2023-10-10 00:25:32,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=179806.66666666666, ans=0.125 2023-10-10 00:25:33,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=179806.66666666666, ans=0.125 2023-10-10 00:25:35,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=179806.66666666666, ans=0.1 2023-10-10 00:25:37,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=179806.66666666666, ans=0.125 2023-10-10 00:25:48,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-10-10 00:25:57,736 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-10-10 00:26:48,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=180086.66666666666, ans=0.125 2023-10-10 00:26:49,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=180086.66666666666, ans=0.0 2023-10-10 00:27:01,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=180133.33333333334, ans=0.05 2023-10-10 00:27:05,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=180180.0, ans=0.125 2023-10-10 00:27:17,129 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.45 vs. limit=12.0 2023-10-10 00:27:19,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=180226.66666666666, ans=0.125 2023-10-10 00:27:21,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=180226.66666666666, ans=0.125 2023-10-10 00:27:23,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=180226.66666666666, ans=0.05 2023-10-10 00:27:24,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.872e+02 2.178e+02 2.398e+02 3.467e+02, threshold=4.356e+02, percent-clipped=0.0 2023-10-10 00:27:37,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=180320.0, ans=0.125 2023-10-10 00:28:04,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=180413.33333333334, ans=0.2 2023-10-10 00:28:21,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=180460.0, ans=0.125 2023-10-10 00:28:36,753 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.58 vs. limit=10.0 2023-10-10 00:28:46,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=180600.0, ans=0.125 2023-10-10 00:28:59,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.64 vs. limit=15.0 2023-10-10 00:29:15,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.887e+02 2.152e+02 2.473e+02 4.069e+02, threshold=4.303e+02, percent-clipped=0.0 2023-10-10 00:29:48,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=180833.33333333334, ans=0.125 2023-10-10 00:29:55,348 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.03 vs. limit=15.0 2023-10-10 00:30:23,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=180973.33333333334, ans=0.5 2023-10-10 00:30:32,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=181020.0, ans=0.125 2023-10-10 00:30:35,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=181020.0, ans=0.2 2023-10-10 00:30:48,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=181113.33333333334, ans=0.125 2023-10-10 00:30:49,492 INFO [train.py:1031] (0/4) Epoch 3, batch 11500, loss[loss=0.2641, simple_loss=0.3527, pruned_loss=0.08771, over 16809.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3369, pruned_loss=0.09285, over 32634270.33 frames. ], batch size: 188, lr: 1.31e-02, grad_scale: 16.0 2023-10-10 00:30:49,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=181113.33333333334, ans=0.1 2023-10-10 00:31:08,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.859e+02 2.091e+02 2.411e+02 3.307e+02, threshold=4.181e+02, percent-clipped=0.0 2023-10-10 00:31:39,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=181300.0, ans=0.05 2023-10-10 00:32:07,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-10-10 00:32:12,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=181440.0, ans=0.2 2023-10-10 00:32:13,726 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-10-10 00:32:20,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=181486.66666666666, ans=0.0 2023-10-10 00:32:22,898 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-10-10 00:32:29,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=181486.66666666666, ans=0.125 2023-10-10 00:32:36,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.78 vs. limit=15.0 2023-10-10 00:32:53,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=181580.0, ans=0.0 2023-10-10 00:32:58,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=181626.66666666666, ans=0.125 2023-10-10 00:33:05,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.861e+02 2.064e+02 2.387e+02 2.869e+02, threshold=4.128e+02, percent-clipped=0.0 2023-10-10 00:33:08,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.85 vs. limit=6.0 2023-10-10 00:33:09,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=181673.33333333334, ans=0.125 2023-10-10 00:33:15,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=181673.33333333334, ans=0.2 2023-10-10 00:33:25,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=181720.0, ans=0.125 2023-10-10 00:33:33,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=181766.66666666666, ans=0.0 2023-10-10 00:33:43,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=181813.33333333334, ans=0.125 2023-10-10 00:33:47,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=181813.33333333334, ans=0.2 2023-10-10 00:34:18,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=181953.33333333334, ans=0.125 2023-10-10 00:34:19,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=181953.33333333334, ans=0.2 2023-10-10 00:34:36,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=182046.66666666666, ans=0.125 2023-10-10 00:34:52,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.865e+02 2.145e+02 2.419e+02 3.775e+02, threshold=4.290e+02, percent-clipped=0.0 2023-10-10 00:35:05,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=182186.66666666666, ans=0.125 2023-10-10 00:35:31,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=182280.0, ans=0.1 2023-10-10 00:35:41,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=182280.0, ans=0.125 2023-10-10 00:35:46,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=182326.66666666666, ans=0.0 2023-10-10 00:35:49,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=182326.66666666666, ans=0.0 2023-10-10 00:36:00,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=182373.33333333334, ans=0.0 2023-10-10 00:36:01,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=182373.33333333334, ans=0.125 2023-10-10 00:36:03,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=182373.33333333334, ans=0.0 2023-10-10 00:36:06,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=182373.33333333334, ans=0.0 2023-10-10 00:36:13,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=182420.0, ans=0.125 2023-10-10 00:36:25,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=182466.66666666666, ans=0.0 2023-10-10 00:36:42,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=182513.33333333334, ans=0.0 2023-10-10 00:36:43,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=182560.0, ans=0.125 2023-10-10 00:36:43,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=182560.0, ans=0.2 2023-10-10 00:36:53,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.762e+02 1.933e+02 2.168e+02 3.098e+02, threshold=3.866e+02, percent-clipped=0.0 2023-10-10 00:37:10,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=182653.33333333334, ans=0.2 2023-10-10 00:37:15,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=182653.33333333334, ans=0.0 2023-10-10 00:37:16,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=182653.33333333334, ans=0.07 2023-10-10 00:37:25,428 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:37:26,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=182700.0, ans=0.0 2023-10-10 00:37:41,809 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:38:10,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.74 vs. limit=15.0 2023-10-10 00:38:29,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=182980.0, ans=15.0 2023-10-10 00:38:35,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.16 vs. limit=15.0 2023-10-10 00:38:36,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=182980.0, ans=0.1 2023-10-10 00:38:36,563 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:38:48,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.893e+02 2.048e+02 2.378e+02 3.433e+02, threshold=4.095e+02, percent-clipped=0.0 2023-10-10 00:38:53,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=183073.33333333334, ans=0.0 2023-10-10 00:39:03,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=183120.0, ans=0.125 2023-10-10 00:40:08,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=183400.0, ans=0.0 2023-10-10 00:40:10,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=183400.0, ans=0.0 2023-10-10 00:40:18,010 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.75 vs. limit=22.5 2023-10-10 00:40:19,631 INFO [train.py:1031] (0/4) Epoch 3, batch 12000, loss[loss=0.2532, simple_loss=0.3379, pruned_loss=0.08424, over 16828.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3365, pruned_loss=0.09215, over 32683046.04 frames. ], batch size: 175, lr: 1.30e-02, grad_scale: 32.0 2023-10-10 00:40:29,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=183446.66666666666, ans=0.125 2023-10-10 00:40:30,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=183446.66666666666, ans=0.0 2023-10-10 00:40:40,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=183493.33333333334, ans=0.0 2023-10-10 00:40:41,179 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:40:41,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.944e+02 2.254e+02 2.569e+02 3.650e+02, threshold=4.508e+02, percent-clipped=0.0 2023-10-10 00:41:03,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=183586.66666666666, ans=0.0 2023-10-10 00:41:06,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.83 vs. limit=22.5 2023-10-10 00:41:27,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=183680.0, ans=0.0 2023-10-10 00:41:31,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=183726.66666666666, ans=0.0 2023-10-10 00:41:41,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=183726.66666666666, ans=0.2 2023-10-10 00:41:43,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=183773.33333333334, ans=0.0 2023-10-10 00:41:43,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=183773.33333333334, ans=0.125 2023-10-10 00:41:50,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=183773.33333333334, ans=0.125 2023-10-10 00:41:51,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=183773.33333333334, ans=0.125 2023-10-10 00:42:05,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183866.66666666666, ans=0.1 2023-10-10 00:42:07,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=183866.66666666666, ans=0.125 2023-10-10 00:42:33,752 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.775e+02 1.941e+02 2.163e+02 3.202e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-10 00:42:35,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=184006.66666666666, ans=0.2 2023-10-10 00:42:42,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=184006.66666666666, ans=0.125 2023-10-10 00:42:51,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=184053.33333333334, ans=0.125 2023-10-10 00:43:13,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=184146.66666666666, ans=0.0 2023-10-10 00:43:25,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=184193.33333333334, ans=0.0 2023-10-10 00:43:29,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=184240.0, ans=0.125 2023-10-10 00:43:59,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=184380.0, ans=0.0 2023-10-10 00:44:04,752 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.54 vs. limit=15.0 2023-10-10 00:44:08,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.86 vs. limit=10.0 2023-10-10 00:44:18,578 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.994e+02 2.215e+02 2.724e+02 3.801e+02, threshold=4.431e+02, percent-clipped=0.0 2023-10-10 00:44:34,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.91 vs. limit=15.0 2023-10-10 00:44:37,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=184520.0, ans=0.0 2023-10-10 00:44:43,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=184566.66666666666, ans=0.125 2023-10-10 00:44:46,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=184566.66666666666, ans=0.0 2023-10-10 00:44:50,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=184566.66666666666, ans=0.125 2023-10-10 00:44:51,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.57 vs. limit=22.5 2023-10-10 00:44:57,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-10-10 00:44:59,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=184613.33333333334, ans=0.2 2023-10-10 00:45:01,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=184613.33333333334, ans=10.0 2023-10-10 00:45:06,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=184660.0, ans=0.0 2023-10-10 00:45:08,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=184660.0, ans=0.0 2023-10-10 00:45:13,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184706.66666666666, ans=0.1 2023-10-10 00:45:14,459 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-10-10 00:45:19,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184706.66666666666, ans=0.1 2023-10-10 00:45:20,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=184706.66666666666, ans=0.125 2023-10-10 00:45:29,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.66 vs. limit=15.0 2023-10-10 00:45:43,791 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.63 vs. limit=22.5 2023-10-10 00:46:08,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.796e+02 2.074e+02 2.351e+02 3.276e+02, threshold=4.149e+02, percent-clipped=0.0 2023-10-10 00:46:11,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184940.0, ans=0.1 2023-10-10 00:46:16,995 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.80 vs. limit=6.0 2023-10-10 00:46:18,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=184940.0, ans=0.125 2023-10-10 00:46:28,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=184986.66666666666, ans=0.125 2023-10-10 00:46:36,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=185033.33333333334, ans=0.125 2023-10-10 00:46:41,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=185033.33333333334, ans=0.0 2023-10-10 00:46:48,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=185080.0, ans=0.0 2023-10-10 00:46:57,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=185126.66666666666, ans=15.0 2023-10-10 00:47:13,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=185173.33333333334, ans=0.05 2023-10-10 00:47:16,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.38 vs. limit=15.0 2023-10-10 00:47:19,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=185220.0, ans=10.0 2023-10-10 00:47:46,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=185313.33333333334, ans=0.2 2023-10-10 00:47:48,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=185313.33333333334, ans=0.1 2023-10-10 00:47:49,678 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:48:04,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.849e+02 2.130e+02 2.520e+02 4.323e+02, threshold=4.259e+02, percent-clipped=1.0 2023-10-10 00:48:30,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=185500.0, ans=0.125 2023-10-10 00:48:36,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-10-10 00:48:49,306 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=28.07 vs. limit=22.5 2023-10-10 00:49:32,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=185733.33333333334, ans=0.0 2023-10-10 00:49:38,510 INFO [train.py:1031] (0/4) Epoch 3, batch 12500, loss[loss=0.2581, simple_loss=0.3295, pruned_loss=0.0934, over 15993.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3357, pruned_loss=0.09182, over 32684937.01 frames. ], batch size: 43, lr: 1.29e-02, grad_scale: 32.0 2023-10-10 00:49:39,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.92 vs. limit=22.5 2023-10-10 00:49:51,358 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:49:53,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=185826.66666666666, ans=0.125 2023-10-10 00:49:59,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.901e+02 2.157e+02 2.436e+02 4.050e+02, threshold=4.315e+02, percent-clipped=0.0 2023-10-10 00:50:00,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=185873.33333333334, ans=0.02 2023-10-10 00:50:00,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.09 vs. limit=15.0 2023-10-10 00:50:32,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=186013.33333333334, ans=0.0 2023-10-10 00:50:39,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=186013.33333333334, ans=0.125 2023-10-10 00:50:45,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2023-10-10 00:51:10,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=186153.33333333334, ans=0.125 2023-10-10 00:51:11,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=186153.33333333334, ans=0.0 2023-10-10 00:51:22,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=186200.0, ans=0.0 2023-10-10 00:51:33,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=186246.66666666666, ans=0.05 2023-10-10 00:51:45,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.873e+02 2.084e+02 2.346e+02 3.459e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-10 00:51:53,691 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.88 vs. limit=15.0 2023-10-10 00:52:01,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=186386.66666666666, ans=0.025 2023-10-10 00:52:07,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=186386.66666666666, ans=0.125 2023-10-10 00:52:09,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=186433.33333333334, ans=0.0 2023-10-10 00:52:15,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=186433.33333333334, ans=0.125 2023-10-10 00:52:21,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=186480.0, ans=0.2 2023-10-10 00:52:35,821 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-10-10 00:52:44,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=186573.33333333334, ans=0.125 2023-10-10 00:52:49,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.89 vs. limit=22.5 2023-10-10 00:52:51,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-10 00:53:03,088 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-40000.pt 2023-10-10 00:53:14,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=186666.66666666666, ans=0.2 2023-10-10 00:53:17,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=186713.33333333334, ans=0.1 2023-10-10 00:53:17,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=186713.33333333334, ans=0.04949747468305833 2023-10-10 00:53:35,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.774e+02 2.060e+02 2.264e+02 3.097e+02, threshold=4.119e+02, percent-clipped=0.0 2023-10-10 00:53:45,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=186806.66666666666, ans=0.1 2023-10-10 00:54:09,265 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.17 vs. limit=15.0 2023-10-10 00:54:17,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=186946.66666666666, ans=15.0 2023-10-10 00:54:48,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1.whitening_limit, batch_count=187086.66666666666, ans=10.0 2023-10-10 00:54:50,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=187086.66666666666, ans=0.125 2023-10-10 00:54:50,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=187086.66666666666, ans=0.07 2023-10-10 00:54:57,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=187133.33333333334, ans=0.2 2023-10-10 00:55:00,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=187133.33333333334, ans=0.125 2023-10-10 00:55:05,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=187180.0, ans=0.2 2023-10-10 00:55:23,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.950e+02 2.197e+02 2.480e+02 3.980e+02, threshold=4.394e+02, percent-clipped=0.0 2023-10-10 00:55:24,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=187273.33333333334, ans=0.0 2023-10-10 00:55:32,657 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:55:38,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=187320.0, ans=0.0 2023-10-10 00:55:42,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=187320.0, ans=0.0 2023-10-10 00:55:45,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=187366.66666666666, ans=0.1 2023-10-10 00:55:48,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.17 vs. limit=22.5 2023-10-10 00:55:50,740 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.43 vs. limit=15.0 2023-10-10 00:56:07,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=187460.0, ans=0.125 2023-10-10 00:56:09,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=187460.0, ans=0.125 2023-10-10 00:56:15,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=187460.0, ans=0.0 2023-10-10 00:56:15,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=187460.0, ans=0.035 2023-10-10 00:56:38,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=187553.33333333334, ans=0.125 2023-10-10 00:56:40,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=187553.33333333334, ans=0.125 2023-10-10 00:56:47,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=187600.0, ans=0.09899494936611666 2023-10-10 00:56:49,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=187600.0, ans=0.0 2023-10-10 00:56:58,047 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.39 vs. limit=15.0 2023-10-10 00:57:08,775 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.77 vs. limit=12.0 2023-10-10 00:57:11,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=187693.33333333334, ans=0.1 2023-10-10 00:57:12,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.906e+02 2.199e+02 2.559e+02 3.758e+02, threshold=4.397e+02, percent-clipped=0.0 2023-10-10 00:57:12,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=187740.0, ans=0.0 2023-10-10 00:57:23,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=187786.66666666666, ans=0.125 2023-10-10 00:57:23,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=187786.66666666666, ans=0.125 2023-10-10 00:57:25,888 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.17 vs. limit=15.0 2023-10-10 00:57:31,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=187786.66666666666, ans=0.125 2023-10-10 00:57:42,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=187833.33333333334, ans=0.125 2023-10-10 00:57:50,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=187880.0, ans=0.125 2023-10-10 00:58:10,293 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:58:16,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=188020.0, ans=0.125 2023-10-10 00:58:24,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=188020.0, ans=0.07 2023-10-10 00:58:33,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=188066.66666666666, ans=0.125 2023-10-10 00:58:38,057 INFO [train.py:1031] (0/4) Epoch 3, batch 13000, loss[loss=0.3378, simple_loss=0.3764, pruned_loss=0.1496, over 15603.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.336, pruned_loss=0.0916, over 32725036.96 frames. ], batch size: 350, lr: 1.29e-02, grad_scale: 32.0 2023-10-10 00:58:45,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=188113.33333333334, ans=0.0 2023-10-10 00:58:57,853 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.813e+02 2.076e+02 2.415e+02 3.629e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-10 00:59:30,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.08 vs. limit=22.5 2023-10-10 00:59:34,630 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 00:59:49,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=188346.66666666666, ans=0.125 2023-10-10 00:59:54,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=188393.33333333334, ans=0.0 2023-10-10 00:59:54,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-10-10 00:59:58,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=188393.33333333334, ans=0.125 2023-10-10 01:00:09,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=188440.0, ans=0.125 2023-10-10 01:00:10,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=188440.0, ans=0.125 2023-10-10 01:00:11,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=188440.0, ans=0.5 2023-10-10 01:00:20,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=188486.66666666666, ans=0.05 2023-10-10 01:00:58,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.808e+02 2.001e+02 2.367e+02 2.873e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-10 01:01:16,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=188720.0, ans=0.125 2023-10-10 01:01:21,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=188766.66666666666, ans=0.125 2023-10-10 01:01:39,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=188813.33333333334, ans=0.125 2023-10-10 01:02:01,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=188906.66666666666, ans=0.1 2023-10-10 01:02:19,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-10-10 01:02:23,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=189000.0, ans=0.125 2023-10-10 01:02:23,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=189000.0, ans=0.125 2023-10-10 01:02:52,940 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.857e+02 2.173e+02 2.492e+02 3.491e+02, threshold=4.346e+02, percent-clipped=0.0 2023-10-10 01:02:53,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=189140.0, ans=0.0 2023-10-10 01:02:57,885 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.71 vs. limit=22.5 2023-10-10 01:03:15,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=189233.33333333334, ans=0.125 2023-10-10 01:03:21,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=189233.33333333334, ans=0.2 2023-10-10 01:03:24,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=189233.33333333334, ans=0.2 2023-10-10 01:03:26,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=189280.0, ans=0.125 2023-10-10 01:03:34,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=189280.0, ans=0.07 2023-10-10 01:03:55,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=189373.33333333334, ans=0.2 2023-10-10 01:04:18,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=189466.66666666666, ans=0.125 2023-10-10 01:04:22,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=189513.33333333334, ans=0.2 2023-10-10 01:04:37,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=189560.0, ans=0.125 2023-10-10 01:04:42,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.925e+02 2.140e+02 2.499e+02 3.876e+02, threshold=4.281e+02, percent-clipped=0.0 2023-10-10 01:05:01,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=189653.33333333334, ans=0.1 2023-10-10 01:05:08,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=189700.0, ans=0.125 2023-10-10 01:05:11,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=189700.0, ans=0.125 2023-10-10 01:05:17,848 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.09 vs. limit=22.5 2023-10-10 01:05:29,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=189793.33333333334, ans=0.5 2023-10-10 01:05:32,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=189793.33333333334, ans=0.1 2023-10-10 01:05:44,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=189840.0, ans=0.125 2023-10-10 01:05:50,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=189886.66666666666, ans=0.0 2023-10-10 01:06:00,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=189933.33333333334, ans=0.125 2023-10-10 01:06:35,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.900e+02 2.146e+02 2.486e+02 3.706e+02, threshold=4.292e+02, percent-clipped=0.0 2023-10-10 01:06:46,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=190120.0, ans=0.05 2023-10-10 01:06:57,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=190166.66666666666, ans=0.125 2023-10-10 01:07:02,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=190166.66666666666, ans=0.1 2023-10-10 01:07:04,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=190166.66666666666, ans=0.0 2023-10-10 01:07:05,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=190166.66666666666, ans=0.0 2023-10-10 01:07:11,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=190213.33333333334, ans=0.125 2023-10-10 01:07:34,870 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=8.0 2023-10-10 01:07:36,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=190306.66666666666, ans=0.0 2023-10-10 01:07:36,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=190306.66666666666, ans=0.07 2023-10-10 01:07:46,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=190353.33333333334, ans=0.125 2023-10-10 01:08:03,990 INFO [train.py:1031] (0/4) Epoch 3, batch 13500, loss[loss=0.2209, simple_loss=0.3122, pruned_loss=0.06482, over 16906.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.335, pruned_loss=0.09104, over 32758373.94 frames. ], batch size: 104, lr: 1.28e-02, grad_scale: 32.0 2023-10-10 01:08:24,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.721e+02 1.948e+02 2.320e+02 4.142e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-10 01:08:32,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.89 vs. limit=15.0 2023-10-10 01:08:33,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=190540.0, ans=0.0 2023-10-10 01:08:48,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=190633.33333333334, ans=0.1 2023-10-10 01:08:49,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=190633.33333333334, ans=0.125 2023-10-10 01:08:54,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=190633.33333333334, ans=0.125 2023-10-10 01:08:59,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=190680.0, ans=0.02 2023-10-10 01:09:05,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=190680.0, ans=0.0 2023-10-10 01:09:09,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=190726.66666666666, ans=0.0 2023-10-10 01:09:20,002 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.70 vs. limit=22.5 2023-10-10 01:09:21,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=190773.33333333334, ans=0.2 2023-10-10 01:09:43,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=190866.66666666666, ans=0.125 2023-10-10 01:10:02,613 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.72 vs. limit=15.0 2023-10-10 01:10:06,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=190960.0, ans=0.0 2023-10-10 01:10:07,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190960.0, ans=0.1 2023-10-10 01:10:08,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=190960.0, ans=0.05 2023-10-10 01:10:10,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.842e+02 2.047e+02 2.375e+02 3.850e+02, threshold=4.094e+02, percent-clipped=0.0 2023-10-10 01:10:15,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=191006.66666666666, ans=0.0 2023-10-10 01:10:16,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=191006.66666666666, ans=0.125 2023-10-10 01:10:22,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=191053.33333333334, ans=0.125 2023-10-10 01:10:26,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=191053.33333333334, ans=0.125 2023-10-10 01:10:27,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191100.0, ans=0.1 2023-10-10 01:10:30,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=191100.0, ans=0.1 2023-10-10 01:10:33,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=191100.0, ans=0.125 2023-10-10 01:10:35,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=191100.0, ans=0.0 2023-10-10 01:10:38,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=191146.66666666666, ans=0.125 2023-10-10 01:10:39,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=191146.66666666666, ans=0.125 2023-10-10 01:10:42,525 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-3.pt 2023-10-10 01:11:11,907 INFO [train.py:1031] (0/4) Epoch 4, batch 0, loss[loss=0.2273, simple_loss=0.3044, pruned_loss=0.07515, over 16837.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3044, pruned_loss=0.07515, over 16837.00 frames. ], batch size: 188, lr: 1.07e-02, grad_scale: 32.0 2023-10-10 01:11:11,908 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-10 01:11:20,236 INFO [train.py:1063] (0/4) Epoch 4, validation: loss=0.2505, simple_loss=0.3358, pruned_loss=0.08267, over 1020973.00 frames. 2023-10-10 01:11:20,237 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-10 01:11:40,039 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.14 vs. limit=10.0 2023-10-10 01:11:41,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=191263.33333333334, ans=0.2 2023-10-10 01:11:46,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=191263.33333333334, ans=0.1 2023-10-10 01:11:47,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=191263.33333333334, ans=0.125 2023-10-10 01:11:49,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=191263.33333333334, ans=0.2 2023-10-10 01:11:55,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191310.0, ans=0.1 2023-10-10 01:12:08,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=191356.66666666666, ans=0.2 2023-10-10 01:12:34,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.686e+02 1.901e+02 2.036e+02 2.676e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-10 01:12:43,207 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:12:50,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=191496.66666666666, ans=0.125 2023-10-10 01:12:57,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=191543.33333333334, ans=0.0 2023-10-10 01:13:04,652 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:13:07,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=191590.0, ans=0.2 2023-10-10 01:13:20,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=191636.66666666666, ans=0.07 2023-10-10 01:13:21,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=191636.66666666666, ans=0.0 2023-10-10 01:13:23,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=191636.66666666666, ans=0.0 2023-10-10 01:13:37,090 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:13:40,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=191730.0, ans=0.0 2023-10-10 01:13:46,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191730.0, ans=0.1 2023-10-10 01:13:49,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=191776.66666666666, ans=10.0 2023-10-10 01:14:16,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=191870.0, ans=0.2 2023-10-10 01:14:25,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.766e+02 1.944e+02 2.177e+02 3.808e+02, threshold=3.888e+02, percent-clipped=1.0 2023-10-10 01:14:34,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=191963.33333333334, ans=0.0 2023-10-10 01:14:37,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=191963.33333333334, ans=0.0 2023-10-10 01:14:57,703 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:14:59,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=192056.66666666666, ans=0.125 2023-10-10 01:14:59,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=192056.66666666666, ans=0.125 2023-10-10 01:15:02,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.54 vs. limit=15.0 2023-10-10 01:15:08,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=192103.33333333334, ans=0.125 2023-10-10 01:15:13,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192150.0, ans=0.1 2023-10-10 01:15:13,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=192150.0, ans=0.1 2023-10-10 01:15:39,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=192243.33333333334, ans=0.1 2023-10-10 01:15:48,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=192290.0, ans=0.1 2023-10-10 01:15:59,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=192290.0, ans=0.125 2023-10-10 01:16:18,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.763e+02 1.957e+02 2.307e+02 3.327e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-10 01:16:21,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=15.0 2023-10-10 01:16:24,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=192430.0, ans=0.125 2023-10-10 01:16:31,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.97 vs. limit=15.0 2023-10-10 01:16:36,948 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:16:40,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=192476.66666666666, ans=0.0 2023-10-10 01:16:58,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=192570.0, ans=0.1 2023-10-10 01:17:03,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.61 vs. limit=6.0 2023-10-10 01:17:06,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=192570.0, ans=0.09899494936611666 2023-10-10 01:17:08,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=15.0 2023-10-10 01:17:35,830 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:17:39,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=192710.0, ans=0.125 2023-10-10 01:17:50,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192756.66666666666, ans=0.1 2023-10-10 01:18:00,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=192803.33333333334, ans=0.0 2023-10-10 01:18:08,230 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.59 vs. limit=22.5 2023-10-10 01:18:08,399 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.321e+02 1.714e+02 2.037e+02 2.314e+02 3.392e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-10 01:18:11,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=192850.0, ans=0.125 2023-10-10 01:18:17,909 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.02 vs. limit=10.0 2023-10-10 01:18:22,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=192896.66666666666, ans=0.1 2023-10-10 01:19:19,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=193176.66666666666, ans=0.0 2023-10-10 01:19:35,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193223.33333333334, ans=0.1 2023-10-10 01:19:51,301 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.38 vs. limit=22.5 2023-10-10 01:19:59,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.949e+02 2.312e+02 2.674e+02 4.571e+02, threshold=4.624e+02, percent-clipped=1.0 2023-10-10 01:20:14,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=193410.0, ans=0.125 2023-10-10 01:20:17,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=193410.0, ans=0.125 2023-10-10 01:20:20,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=193410.0, ans=0.2 2023-10-10 01:20:38,620 INFO [train.py:1031] (0/4) Epoch 4, batch 500, loss[loss=0.2144, simple_loss=0.3027, pruned_loss=0.06309, over 16615.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3319, pruned_loss=0.08841, over 7305084.14 frames. ], batch size: 66, lr: 1.07e-02, grad_scale: 32.0 2023-10-10 01:20:40,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=193503.33333333334, ans=0.125 2023-10-10 01:20:47,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=193503.33333333334, ans=0.0 2023-10-10 01:20:50,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=193550.0, ans=10.0 2023-10-10 01:21:08,004 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:21:11,838 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.01 vs. limit=22.5 2023-10-10 01:21:15,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=193643.33333333334, ans=0.0 2023-10-10 01:21:15,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.02 vs. limit=15.0 2023-10-10 01:21:45,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-10-10 01:21:50,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.814e+02 1.997e+02 2.239e+02 3.318e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-10 01:22:06,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=193876.66666666666, ans=0.125 2023-10-10 01:22:07,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=193876.66666666666, ans=0.125 2023-10-10 01:22:09,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193876.66666666666, ans=0.1 2023-10-10 01:22:20,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=193923.33333333334, ans=0.125 2023-10-10 01:22:53,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=194063.33333333334, ans=0.0 2023-10-10 01:23:31,610 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=22.5 2023-10-10 01:23:37,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194250.0, ans=0.1 2023-10-10 01:23:39,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.712e+02 1.980e+02 2.236e+02 4.011e+02, threshold=3.960e+02, percent-clipped=1.0 2023-10-10 01:23:43,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=194250.0, ans=0.125 2023-10-10 01:24:04,432 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.07 vs. limit=6.0 2023-10-10 01:24:18,242 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.06 vs. limit=15.0 2023-10-10 01:24:40,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=194530.0, ans=0.2 2023-10-10 01:24:46,232 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=15.0 2023-10-10 01:24:51,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=194576.66666666666, ans=0.125 2023-10-10 01:25:06,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=194623.33333333334, ans=0.0 2023-10-10 01:25:32,806 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.786e+02 1.920e+02 2.132e+02 2.991e+02, threshold=3.839e+02, percent-clipped=0.0 2023-10-10 01:25:43,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=194763.33333333334, ans=0.2 2023-10-10 01:25:48,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=194810.0, ans=0.0 2023-10-10 01:25:48,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=194810.0, ans=0.125 2023-10-10 01:26:22,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194950.0, ans=0.1 2023-10-10 01:26:36,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194996.66666666666, ans=0.1 2023-10-10 01:26:49,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=195043.33333333334, ans=0.125 2023-10-10 01:27:14,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=195136.66666666666, ans=0.125 2023-10-10 01:27:21,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=195183.33333333334, ans=0.125 2023-10-10 01:27:27,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.797e+02 1.968e+02 2.163e+02 3.994e+02, threshold=3.937e+02, percent-clipped=1.0 2023-10-10 01:27:39,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=195230.0, ans=0.125 2023-10-10 01:27:39,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.47 vs. limit=15.0 2023-10-10 01:27:51,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=195276.66666666666, ans=0.0 2023-10-10 01:28:11,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195370.0, ans=0.1 2023-10-10 01:28:22,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=195416.66666666666, ans=0.0 2023-10-10 01:28:30,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=195463.33333333334, ans=0.125 2023-10-10 01:28:32,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=195463.33333333334, ans=0.035 2023-10-10 01:28:58,160 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.27 vs. limit=15.0 2023-10-10 01:29:11,866 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:29:11,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=195603.33333333334, ans=0.2 2023-10-10 01:29:21,489 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.657e+02 1.922e+02 2.206e+02 3.120e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-10 01:29:21,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=195650.0, ans=0.125 2023-10-10 01:29:22,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=195650.0, ans=0.125 2023-10-10 01:29:48,265 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-10-10 01:29:58,423 INFO [train.py:1031] (0/4) Epoch 4, batch 1000, loss[loss=0.2011, simple_loss=0.3004, pruned_loss=0.05091, over 16809.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3316, pruned_loss=0.08775, over 12959273.00 frames. ], batch size: 98, lr: 1.06e-02, grad_scale: 32.0 2023-10-10 01:30:23,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=195930.0, ans=0.125 2023-10-10 01:30:43,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=196023.33333333334, ans=0.125 2023-10-10 01:31:05,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=196116.66666666666, ans=0.0 2023-10-10 01:31:07,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.783e+02 2.000e+02 2.205e+02 3.095e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-10 01:31:08,489 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-10-10 01:31:10,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.05 vs. limit=15.0 2023-10-10 01:31:52,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=196303.33333333334, ans=0.2 2023-10-10 01:31:55,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=196350.0, ans=0.125 2023-10-10 01:32:08,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=196396.66666666666, ans=0.0 2023-10-10 01:32:11,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=196396.66666666666, ans=0.125 2023-10-10 01:32:13,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=196396.66666666666, ans=0.0 2023-10-10 01:32:22,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=196443.33333333334, ans=0.125 2023-10-10 01:32:32,424 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.54 vs. limit=6.0 2023-10-10 01:32:40,572 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.44 vs. limit=15.0 2023-10-10 01:32:43,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=196536.66666666666, ans=0.125 2023-10-10 01:32:55,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=196583.33333333334, ans=0.0 2023-10-10 01:33:02,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.783e+02 1.989e+02 2.444e+02 3.708e+02, threshold=3.978e+02, percent-clipped=0.0 2023-10-10 01:33:02,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=196583.33333333334, ans=0.125 2023-10-10 01:34:01,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=196816.66666666666, ans=0.0 2023-10-10 01:34:09,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=196863.33333333334, ans=0.0 2023-10-10 01:34:39,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.33 vs. limit=15.0 2023-10-10 01:34:53,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=197050.0, ans=0.0 2023-10-10 01:35:00,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.679e+02 1.904e+02 2.231e+02 3.118e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-10 01:35:02,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=197050.0, ans=0.125 2023-10-10 01:35:07,577 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.23 vs. limit=15.0 2023-10-10 01:35:13,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=197096.66666666666, ans=0.1 2023-10-10 01:35:14,113 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-10 01:35:31,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=197190.0, ans=0.125 2023-10-10 01:35:32,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=197190.0, ans=0.5 2023-10-10 01:35:32,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=197190.0, ans=0.95 2023-10-10 01:35:46,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=197236.66666666666, ans=0.2 2023-10-10 01:36:03,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=197330.0, ans=0.1 2023-10-10 01:36:06,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=197330.0, ans=0.1 2023-10-10 01:36:06,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-10-10 01:36:09,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=197330.0, ans=0.125 2023-10-10 01:36:15,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=197376.66666666666, ans=0.0 2023-10-10 01:36:17,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=197376.66666666666, ans=0.125 2023-10-10 01:36:18,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197376.66666666666, ans=0.1 2023-10-10 01:36:40,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=197470.0, ans=0.125 2023-10-10 01:36:46,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=197516.66666666666, ans=0.0 2023-10-10 01:36:47,550 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.59 vs. limit=15.0 2023-10-10 01:36:49,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.701e+02 1.863e+02 2.113e+02 3.574e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-10 01:36:51,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=197516.66666666666, ans=0.125 2023-10-10 01:37:36,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=197703.33333333334, ans=0.0 2023-10-10 01:37:38,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=197703.33333333334, ans=0.0 2023-10-10 01:38:00,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=197843.33333333334, ans=0.1 2023-10-10 01:38:03,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=197843.33333333334, ans=0.125 2023-10-10 01:38:14,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=197890.0, ans=0.0 2023-10-10 01:38:25,655 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.03 vs. limit=10.0 2023-10-10 01:38:37,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=197936.66666666666, ans=0.125 2023-10-10 01:38:41,554 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2023-10-10 01:38:44,940 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.709e+02 1.896e+02 2.165e+02 3.150e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-10 01:39:12,069 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-10-10 01:39:27,845 INFO [train.py:1031] (0/4) Epoch 4, batch 1500, loss[loss=0.233, simple_loss=0.3213, pruned_loss=0.07232, over 16811.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3286, pruned_loss=0.08576, over 17360022.67 frames. ], batch size: 98, lr: 1.06e-02, grad_scale: 32.0 2023-10-10 01:39:28,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=198170.0, ans=0.1 2023-10-10 01:39:42,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=198216.66666666666, ans=0.125 2023-10-10 01:39:55,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=198263.33333333334, ans=0.0 2023-10-10 01:39:55,421 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:39:56,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=198263.33333333334, ans=0.125 2023-10-10 01:40:23,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=198403.33333333334, ans=0.0 2023-10-10 01:40:40,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.717e+02 1.867e+02 2.074e+02 2.969e+02, threshold=3.735e+02, percent-clipped=0.0 2023-10-10 01:40:43,307 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.90 vs. limit=15.0 2023-10-10 01:40:44,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=198450.0, ans=0.07 2023-10-10 01:41:07,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=198543.33333333334, ans=15.0 2023-10-10 01:41:40,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=198730.0, ans=0.125 2023-10-10 01:41:47,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-10 01:42:14,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=198823.33333333334, ans=0.1 2023-10-10 01:42:16,302 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:42:27,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=198870.0, ans=0.125 2023-10-10 01:42:31,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=198916.66666666666, ans=0.5 2023-10-10 01:42:31,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=198916.66666666666, ans=0.125 2023-10-10 01:42:36,524 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.771e+02 1.942e+02 2.224e+02 3.344e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-10 01:42:38,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=198916.66666666666, ans=0.2 2023-10-10 01:42:38,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=198916.66666666666, ans=0.2 2023-10-10 01:42:48,947 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-10 01:42:51,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=198963.33333333334, ans=0.125 2023-10-10 01:43:12,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=199056.66666666666, ans=0.125 2023-10-10 01:43:24,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=199150.0, ans=0.125 2023-10-10 01:43:41,160 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.48 vs. limit=12.0 2023-10-10 01:44:25,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.785e+02 1.983e+02 2.316e+02 3.126e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-10 01:44:56,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=199523.33333333334, ans=0.125 2023-10-10 01:45:02,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=199523.33333333334, ans=0.125 2023-10-10 01:45:04,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=199523.33333333334, ans=0.125 2023-10-10 01:45:07,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=199570.0, ans=0.0 2023-10-10 01:45:15,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=199570.0, ans=0.125 2023-10-10 01:45:18,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=199616.66666666666, ans=0.125 2023-10-10 01:45:21,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=199616.66666666666, ans=15.0 2023-10-10 01:45:25,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=199616.66666666666, ans=0.125 2023-10-10 01:45:51,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=199710.0, ans=0.09899494936611666 2023-10-10 01:46:00,322 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.60 vs. limit=10.0 2023-10-10 01:46:12,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=199803.33333333334, ans=0.0 2023-10-10 01:46:22,900 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.687e+02 1.856e+02 2.154e+02 3.052e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-10 01:46:38,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=199943.33333333334, ans=0.0 2023-10-10 01:47:25,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=200130.0, ans=0.125 2023-10-10 01:47:56,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200223.33333333334, ans=0.1 2023-10-10 01:48:25,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.783e+02 1.928e+02 2.163e+02 3.381e+02, threshold=3.856e+02, percent-clipped=0.0 2023-10-10 01:48:34,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200363.33333333334, ans=0.1 2023-10-10 01:48:35,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=200363.33333333334, ans=0.05 2023-10-10 01:49:06,887 INFO [train.py:1031] (0/4) Epoch 4, batch 2000, loss[loss=0.259, simple_loss=0.3353, pruned_loss=0.09132, over 16091.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3284, pruned_loss=0.0853, over 20787882.43 frames. ], batch size: 296, lr: 1.05e-02, grad_scale: 64.0 2023-10-10 01:49:19,396 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:49:34,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=200596.66666666666, ans=0.0 2023-10-10 01:49:40,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=200596.66666666666, ans=0.125 2023-10-10 01:49:49,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.09 vs. limit=10.0 2023-10-10 01:50:29,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.31 vs. limit=15.0 2023-10-10 01:50:29,473 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.747e+02 1.954e+02 2.274e+02 3.166e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-10 01:50:38,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=200830.0, ans=0.125 2023-10-10 01:50:39,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=200830.0, ans=0.125 2023-10-10 01:51:01,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=200923.33333333334, ans=0.125 2023-10-10 01:51:17,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=200970.0, ans=0.2 2023-10-10 01:51:19,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=200970.0, ans=0.2 2023-10-10 01:51:27,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.28 vs. limit=15.0 2023-10-10 01:51:41,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.59 vs. limit=6.0 2023-10-10 01:51:49,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=201063.33333333334, ans=0.0 2023-10-10 01:52:02,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=201110.0, ans=0.2 2023-10-10 01:52:03,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=201110.0, ans=0.125 2023-10-10 01:52:23,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=201156.66666666666, ans=0.0 2023-10-10 01:52:33,351 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=12.0 2023-10-10 01:52:46,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.786e+02 2.055e+02 2.517e+02 4.090e+02, threshold=4.109e+02, percent-clipped=1.0 2023-10-10 01:53:02,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=201296.66666666666, ans=0.0 2023-10-10 01:53:20,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=201390.0, ans=0.125 2023-10-10 01:53:21,336 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-10-10 01:53:23,282 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 01:53:31,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=201436.66666666666, ans=0.0 2023-10-10 01:53:46,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=201483.33333333334, ans=10.0 2023-10-10 01:53:55,698 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.56 vs. limit=15.0 2023-10-10 01:54:04,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=201576.66666666666, ans=0.0 2023-10-10 01:54:08,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=201576.66666666666, ans=0.125 2023-10-10 01:54:12,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=201623.33333333334, ans=0.0 2023-10-10 01:54:15,452 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.08 vs. limit=15.0 2023-10-10 01:54:18,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=201623.33333333334, ans=0.0 2023-10-10 01:54:39,313 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.812e+02 2.033e+02 2.373e+02 3.202e+02, threshold=4.066e+02, percent-clipped=0.0 2023-10-10 01:54:40,842 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-10-10 01:54:43,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=201763.33333333334, ans=0.0 2023-10-10 01:55:10,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=201856.66666666666, ans=0.0 2023-10-10 01:55:20,136 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.71 vs. limit=15.0 2023-10-10 01:55:21,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=201903.33333333334, ans=0.125 2023-10-10 01:55:46,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=201996.66666666666, ans=0.1 2023-10-10 01:55:47,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=201996.66666666666, ans=0.125 2023-10-10 01:56:07,873 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.63 vs. limit=15.0 2023-10-10 01:56:14,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=202136.66666666666, ans=0.125 2023-10-10 01:56:15,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=202136.66666666666, ans=0.125 2023-10-10 01:56:18,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=202136.66666666666, ans=0.125 2023-10-10 01:56:22,098 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.82 vs. limit=15.0 2023-10-10 01:56:26,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202183.33333333334, ans=0.1 2023-10-10 01:56:26,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=202183.33333333334, ans=0.2 2023-10-10 01:56:31,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.800e+02 2.097e+02 2.321e+02 3.365e+02, threshold=4.194e+02, percent-clipped=0.0 2023-10-10 01:56:40,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=202230.0, ans=0.04949747468305833 2023-10-10 01:56:42,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.41 vs. limit=22.5 2023-10-10 01:56:45,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=202276.66666666666, ans=0.1 2023-10-10 01:56:54,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=202276.66666666666, ans=0.035 2023-10-10 01:56:55,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202276.66666666666, ans=0.1 2023-10-10 01:57:03,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=202323.33333333334, ans=0.125 2023-10-10 01:57:08,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.70 vs. limit=15.0 2023-10-10 01:57:11,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=202370.0, ans=0.125 2023-10-10 01:57:17,708 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.58 vs. limit=10.0 2023-10-10 01:57:30,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=202463.33333333334, ans=0.2 2023-10-10 01:57:40,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.92 vs. limit=15.0 2023-10-10 01:57:48,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202556.66666666666, ans=0.1 2023-10-10 01:57:51,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=202556.66666666666, ans=0.125 2023-10-10 01:57:56,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=202556.66666666666, ans=0.125 2023-10-10 01:58:10,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=202650.0, ans=0.0 2023-10-10 01:58:17,285 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.821e+02 2.148e+02 2.337e+02 3.245e+02, threshold=4.295e+02, percent-clipped=0.0 2023-10-10 01:58:35,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=202743.33333333334, ans=0.125 2023-10-10 01:58:39,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202743.33333333334, ans=0.1 2023-10-10 01:58:41,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-10-10 01:58:41,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=202790.0, ans=0.0 2023-10-10 01:58:42,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=202790.0, ans=0.2 2023-10-10 01:58:48,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.58 vs. limit=15.0 2023-10-10 01:58:51,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=202790.0, ans=0.2 2023-10-10 01:58:52,504 INFO [train.py:1031] (0/4) Epoch 4, batch 2500, loss[loss=0.2241, simple_loss=0.282, pruned_loss=0.08307, over 12612.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3288, pruned_loss=0.08583, over 23427851.18 frames. ], batch size: 440, lr: 1.04e-02, grad_scale: 32.0 2023-10-10 01:58:52,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=202836.66666666666, ans=0.125 2023-10-10 01:58:56,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=202836.66666666666, ans=0.0 2023-10-10 01:58:57,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=202836.66666666666, ans=0.125 2023-10-10 01:58:59,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=202836.66666666666, ans=0.2 2023-10-10 01:59:03,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=202883.33333333334, ans=0.125 2023-10-10 01:59:21,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=202976.66666666666, ans=0.0 2023-10-10 01:59:59,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.781e+02 1.997e+02 2.293e+02 3.596e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-10 02:00:02,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=203116.66666666666, ans=0.125 2023-10-10 02:00:20,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=203210.0, ans=0.1 2023-10-10 02:00:22,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.37 vs. limit=15.0 2023-10-10 02:00:34,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=203256.66666666666, ans=0.125 2023-10-10 02:00:54,782 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:00:54,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=203350.0, ans=0.125 2023-10-10 02:01:12,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=203443.33333333334, ans=0.0 2023-10-10 02:01:12,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=203443.33333333334, ans=0.125 2023-10-10 02:01:18,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.21 vs. limit=15.0 2023-10-10 02:01:37,568 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=12.0 2023-10-10 02:01:42,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=203536.66666666666, ans=0.0 2023-10-10 02:01:51,169 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.804e+02 2.005e+02 2.297e+02 3.145e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-10 02:01:58,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=203630.0, ans=0.125 2023-10-10 02:02:09,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=203676.66666666666, ans=0.0 2023-10-10 02:02:11,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=203676.66666666666, ans=0.0 2023-10-10 02:02:35,979 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:02:50,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=203816.66666666666, ans=0.09899494936611666 2023-10-10 02:02:50,376 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.12 vs. limit=22.5 2023-10-10 02:03:34,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=204003.33333333334, ans=0.0 2023-10-10 02:03:39,530 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.32 vs. limit=15.0 2023-10-10 02:03:43,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=204050.0, ans=0.125 2023-10-10 02:03:49,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.689e+02 1.857e+02 2.109e+02 3.521e+02, threshold=3.714e+02, percent-clipped=0.0 2023-10-10 02:04:12,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=204143.33333333334, ans=0.125 2023-10-10 02:04:16,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=204190.0, ans=0.125 2023-10-10 02:04:33,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=204236.66666666666, ans=0.0 2023-10-10 02:04:48,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=204283.33333333334, ans=0.0 2023-10-10 02:04:59,272 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=15.0 2023-10-10 02:05:24,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=204423.33333333334, ans=0.0 2023-10-10 02:05:44,863 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.796e+02 2.076e+02 2.594e+02 3.813e+02, threshold=4.152e+02, percent-clipped=2.0 2023-10-10 02:05:56,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=204563.33333333334, ans=0.125 2023-10-10 02:06:01,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=204610.0, ans=0.125 2023-10-10 02:06:13,871 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.28 vs. limit=6.0 2023-10-10 02:06:18,941 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.087e-03 2023-10-10 02:06:23,695 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-10-10 02:06:30,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.16 vs. limit=15.0 2023-10-10 02:06:37,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=204703.33333333334, ans=0.07 2023-10-10 02:07:23,885 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:07:25,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=204936.66666666666, ans=0.125 2023-10-10 02:07:33,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=204936.66666666666, ans=0.0 2023-10-10 02:07:42,013 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.749e+02 1.882e+02 2.081e+02 2.740e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-10 02:07:49,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=205030.0, ans=0.125 2023-10-10 02:08:17,487 INFO [train.py:1031] (0/4) Epoch 4, batch 3000, loss[loss=0.277, simple_loss=0.3474, pruned_loss=0.1032, over 16589.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3274, pruned_loss=0.08538, over 25492966.92 frames. ], batch size: 241, lr: 1.04e-02, grad_scale: 32.0 2023-10-10 02:08:27,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=205216.66666666666, ans=15.0 2023-10-10 02:09:03,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=205356.66666666666, ans=0.125 2023-10-10 02:09:05,121 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-10 02:09:07,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=205356.66666666666, ans=0.2 2023-10-10 02:09:07,943 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=12.0 2023-10-10 02:09:13,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=205403.33333333334, ans=0.125 2023-10-10 02:09:31,465 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.744e+02 1.941e+02 2.104e+02 3.008e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-10 02:09:43,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=205496.66666666666, ans=0.2 2023-10-10 02:09:54,900 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.78 vs. limit=15.0 2023-10-10 02:09:57,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=205543.33333333334, ans=0.0 2023-10-10 02:10:04,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=205590.0, ans=0.125 2023-10-10 02:10:10,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=205590.0, ans=22.5 2023-10-10 02:10:39,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=205730.0, ans=0.125 2023-10-10 02:10:41,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.10 vs. limit=22.5 2023-10-10 02:10:47,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=205730.0, ans=0.125 2023-10-10 02:10:58,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=205776.66666666666, ans=0.125 2023-10-10 02:11:03,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=205823.33333333334, ans=0.125 2023-10-10 02:11:21,583 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.81 vs. limit=15.0 2023-10-10 02:11:25,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.33 vs. limit=15.0 2023-10-10 02:11:29,483 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.706e+02 1.926e+02 2.287e+02 3.827e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-10 02:11:33,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=205963.33333333334, ans=0.07 2023-10-10 02:11:35,581 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:11:59,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=206056.66666666666, ans=0.0 2023-10-10 02:11:59,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=206056.66666666666, ans=0.125 2023-10-10 02:12:05,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=206103.33333333334, ans=0.125 2023-10-10 02:12:30,668 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:12:43,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=206196.66666666666, ans=0.125 2023-10-10 02:12:55,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=206290.0, ans=0.125 2023-10-10 02:13:11,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=206336.66666666666, ans=0.0 2023-10-10 02:13:30,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.778e+02 1.961e+02 2.323e+02 3.840e+02, threshold=3.921e+02, percent-clipped=0.0 2023-10-10 02:13:35,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=206430.0, ans=0.125 2023-10-10 02:13:39,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-10-10 02:13:44,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=206430.0, ans=0.1 2023-10-10 02:13:47,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=206476.66666666666, ans=0.1 2023-10-10 02:13:52,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=206476.66666666666, ans=0.125 2023-10-10 02:13:56,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=206476.66666666666, ans=0.0 2023-10-10 02:14:40,542 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:14:41,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=206710.0, ans=0.1 2023-10-10 02:14:55,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=206756.66666666666, ans=0.125 2023-10-10 02:15:02,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=206803.33333333334, ans=0.125 2023-10-10 02:15:15,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=206850.0, ans=0.0 2023-10-10 02:15:19,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.880e+02 2.112e+02 2.409e+02 4.134e+02, threshold=4.224e+02, percent-clipped=1.0 2023-10-10 02:15:25,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=206896.66666666666, ans=0.125 2023-10-10 02:15:31,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=206896.66666666666, ans=0.0 2023-10-10 02:15:45,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=206943.33333333334, ans=0.125 2023-10-10 02:15:59,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=207036.66666666666, ans=0.125 2023-10-10 02:16:26,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=207130.0, ans=0.125 2023-10-10 02:16:39,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=207176.66666666666, ans=0.0 2023-10-10 02:16:44,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=207223.33333333334, ans=0.125 2023-10-10 02:16:50,649 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.95 vs. limit=22.5 2023-10-10 02:17:12,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.805e+02 2.015e+02 2.324e+02 3.356e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-10 02:17:28,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=207410.0, ans=0.05 2023-10-10 02:17:43,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=207456.66666666666, ans=0.125 2023-10-10 02:17:47,573 INFO [train.py:1031] (0/4) Epoch 4, batch 3500, loss[loss=0.2519, simple_loss=0.3316, pruned_loss=0.08607, over 16921.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.327, pruned_loss=0.08525, over 27081693.96 frames. ], batch size: 77, lr: 1.03e-02, grad_scale: 16.0 2023-10-10 02:18:06,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=207550.0, ans=0.125 2023-10-10 02:18:19,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=207643.33333333334, ans=0.1 2023-10-10 02:18:20,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=207643.33333333334, ans=0.125 2023-10-10 02:18:33,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=207690.0, ans=0.125 2023-10-10 02:18:55,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=207783.33333333334, ans=0.125 2023-10-10 02:19:03,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.839e+02 2.013e+02 2.281e+02 3.580e+02, threshold=4.025e+02, percent-clipped=0.0 2023-10-10 02:19:12,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=207830.0, ans=0.125 2023-10-10 02:19:33,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=207923.33333333334, ans=0.0 2023-10-10 02:19:52,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=207970.0, ans=0.09899494936611666 2023-10-10 02:19:59,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=208016.66666666666, ans=0.125 2023-10-10 02:20:06,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=208016.66666666666, ans=0.125 2023-10-10 02:20:10,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=208063.33333333334, ans=0.1 2023-10-10 02:20:17,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=208063.33333333334, ans=0.125 2023-10-10 02:20:20,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208063.33333333334, ans=0.1 2023-10-10 02:20:48,506 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-10-10 02:20:52,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=208203.33333333334, ans=0.0 2023-10-10 02:21:05,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.757e+02 2.049e+02 2.459e+02 4.060e+02, threshold=4.098e+02, percent-clipped=1.0 2023-10-10 02:21:10,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=208296.66666666666, ans=0.125 2023-10-10 02:21:15,304 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-10-10 02:21:29,212 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-10-10 02:21:37,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=208390.0, ans=0.125 2023-10-10 02:22:01,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=208483.33333333334, ans=0.2 2023-10-10 02:22:06,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=208530.0, ans=0.0 2023-10-10 02:22:09,689 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.15 vs. limit=15.0 2023-10-10 02:22:19,740 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.27 vs. limit=10.0 2023-10-10 02:22:29,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=208576.66666666666, ans=0.0 2023-10-10 02:22:36,504 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.57 vs. limit=10.0 2023-10-10 02:22:41,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=208670.0, ans=0.1 2023-10-10 02:22:45,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.48 vs. limit=22.5 2023-10-10 02:22:51,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=208670.0, ans=0.0 2023-10-10 02:23:01,206 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.679e+02 1.833e+02 2.154e+02 2.925e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-10 02:23:10,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=208763.33333333334, ans=0.125 2023-10-10 02:23:20,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=208810.0, ans=10.0 2023-10-10 02:23:21,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.07 vs. limit=15.0 2023-10-10 02:23:55,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=208950.0, ans=0.125 2023-10-10 02:24:04,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=208996.66666666666, ans=0.125 2023-10-10 02:24:12,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=208996.66666666666, ans=0.2 2023-10-10 02:24:29,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=209090.0, ans=12.0 2023-10-10 02:24:30,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=209090.0, ans=0.09899494936611666 2023-10-10 02:24:41,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=209136.66666666666, ans=0.2 2023-10-10 02:24:47,349 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-10-10 02:24:52,766 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.699e+02 1.847e+02 2.039e+02 2.847e+02, threshold=3.695e+02, percent-clipped=0.0 2023-10-10 02:24:54,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=209183.33333333334, ans=0.125 2023-10-10 02:24:59,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=209230.0, ans=0.0 2023-10-10 02:25:16,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=209323.33333333334, ans=0.0 2023-10-10 02:25:54,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=209463.33333333334, ans=0.5 2023-10-10 02:26:02,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=209510.0, ans=0.2 2023-10-10 02:26:10,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-10-10 02:26:17,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=209556.66666666666, ans=0.125 2023-10-10 02:26:19,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=209556.66666666666, ans=0.125 2023-10-10 02:26:24,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.21 vs. limit=15.0 2023-10-10 02:26:43,610 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.772e+02 2.028e+02 2.261e+02 3.926e+02, threshold=4.055e+02, percent-clipped=1.0 2023-10-10 02:27:08,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=209790.0, ans=0.125 2023-10-10 02:27:15,066 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-10-10 02:27:20,182 INFO [train.py:1031] (0/4) Epoch 4, batch 4000, loss[loss=0.2605, simple_loss=0.3406, pruned_loss=0.09023, over 16853.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3264, pruned_loss=0.08504, over 28331138.13 frames. ], batch size: 116, lr: 1.03e-02, grad_scale: 32.0 2023-10-10 02:27:23,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=209836.66666666666, ans=0.125 2023-10-10 02:27:23,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=209836.66666666666, ans=0.0 2023-10-10 02:27:35,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=209883.33333333334, ans=0.125 2023-10-10 02:27:36,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209883.33333333334, ans=0.1 2023-10-10 02:27:45,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=209930.0, ans=0.0 2023-10-10 02:27:50,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=209930.0, ans=0.125 2023-10-10 02:27:52,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=209930.0, ans=0.0 2023-10-10 02:28:04,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-10-10 02:28:20,588 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.08 vs. limit=15.0 2023-10-10 02:28:27,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.17 vs. limit=22.5 2023-10-10 02:28:32,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=210116.66666666666, ans=0.0 2023-10-10 02:28:37,279 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.752e+02 2.040e+02 2.309e+02 3.400e+02, threshold=4.080e+02, percent-clipped=0.0 2023-10-10 02:28:41,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=210163.33333333334, ans=0.125 2023-10-10 02:28:41,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2023-10-10 02:28:57,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=210210.0, ans=0.5 2023-10-10 02:29:18,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=210303.33333333334, ans=0.1 2023-10-10 02:29:19,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=210303.33333333334, ans=0.0 2023-10-10 02:29:26,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=210350.0, ans=0.125 2023-10-10 02:29:30,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=210350.0, ans=0.125 2023-10-10 02:29:53,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=210443.33333333334, ans=0.125 2023-10-10 02:30:02,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=210490.0, ans=0.125 2023-10-10 02:30:04,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=210490.0, ans=0.2 2023-10-10 02:30:05,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.73 vs. limit=12.0 2023-10-10 02:30:05,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=210490.0, ans=0.125 2023-10-10 02:30:10,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.05 vs. limit=15.0 2023-10-10 02:30:26,573 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.08 vs. limit=22.5 2023-10-10 02:30:32,584 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.866e+02 2.079e+02 2.472e+02 4.433e+02, threshold=4.159e+02, percent-clipped=1.0 2023-10-10 02:30:35,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-10-10 02:30:36,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=210630.0, ans=0.07 2023-10-10 02:31:12,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=210723.33333333334, ans=0.0 2023-10-10 02:31:48,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=210863.33333333334, ans=0.2 2023-10-10 02:32:06,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.47 vs. limit=22.5 2023-10-10 02:32:07,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=210956.66666666666, ans=10.0 2023-10-10 02:32:09,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-10-10 02:32:10,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=210956.66666666666, ans=0.0 2023-10-10 02:32:19,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=211003.33333333334, ans=0.0 2023-10-10 02:32:30,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=211050.0, ans=0.0 2023-10-10 02:32:38,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.689e+02 1.894e+02 2.080e+02 3.126e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-10 02:32:39,203 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:32:51,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=211143.33333333334, ans=0.125 2023-10-10 02:32:58,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=211143.33333333334, ans=0.0 2023-10-10 02:33:06,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=211190.0, ans=0.07 2023-10-10 02:33:08,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=211190.0, ans=10.0 2023-10-10 02:33:46,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211376.66666666666, ans=0.1 2023-10-10 02:34:06,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=211423.33333333334, ans=0.125 2023-10-10 02:34:10,552 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-10-10 02:34:29,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.902e+02 2.143e+02 2.545e+02 3.766e+02, threshold=4.287e+02, percent-clipped=0.0 2023-10-10 02:34:36,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=211563.33333333334, ans=0.125 2023-10-10 02:35:12,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=211703.33333333334, ans=0.125 2023-10-10 02:35:13,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=211703.33333333334, ans=0.1 2023-10-10 02:35:15,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=211750.0, ans=0.0 2023-10-10 02:35:23,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=211750.0, ans=0.125 2023-10-10 02:35:33,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=211796.66666666666, ans=0.04949747468305833 2023-10-10 02:35:59,775 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-10-10 02:36:11,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=211936.66666666666, ans=0.125 2023-10-10 02:36:14,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=211936.66666666666, ans=0.05 2023-10-10 02:36:15,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=211936.66666666666, ans=0.0 2023-10-10 02:36:16,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=211936.66666666666, ans=0.1 2023-10-10 02:36:16,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-10-10 02:36:17,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=211936.66666666666, ans=0.125 2023-10-10 02:36:20,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=211983.33333333334, ans=0.125 2023-10-10 02:36:28,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.797e+02 2.105e+02 2.350e+02 3.467e+02, threshold=4.210e+02, percent-clipped=0.0 2023-10-10 02:36:33,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=212030.0, ans=0.0 2023-10-10 02:36:34,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=212030.0, ans=0.0 2023-10-10 02:37:03,898 INFO [train.py:1031] (0/4) Epoch 4, batch 4500, loss[loss=0.2622, simple_loss=0.3407, pruned_loss=0.09181, over 16886.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3263, pruned_loss=0.08454, over 29315802.20 frames. ], batch size: 130, lr: 1.02e-02, grad_scale: 32.0 2023-10-10 02:37:26,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=212263.33333333334, ans=0.125 2023-10-10 02:37:28,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=212263.33333333334, ans=0.125 2023-10-10 02:37:40,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=212310.0, ans=0.0 2023-10-10 02:37:42,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=212310.0, ans=0.2 2023-10-10 02:38:14,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.752e+02 1.966e+02 2.322e+02 3.324e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-10 02:38:47,451 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.63 vs. limit=15.0 2023-10-10 02:38:48,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=212636.66666666666, ans=0.125 2023-10-10 02:38:55,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=212636.66666666666, ans=0.1 2023-10-10 02:38:56,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212683.33333333334, ans=0.1 2023-10-10 02:39:07,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=212730.0, ans=0.0 2023-10-10 02:39:29,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=212823.33333333334, ans=0.0 2023-10-10 02:39:41,826 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.95 vs. limit=10.0 2023-10-10 02:39:57,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=212916.66666666666, ans=15.0 2023-10-10 02:39:57,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.869e+02 2.079e+02 2.382e+02 3.819e+02, threshold=4.157e+02, percent-clipped=0.0 2023-10-10 02:40:00,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-10-10 02:40:02,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=212963.33333333334, ans=0.0 2023-10-10 02:40:24,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213056.66666666666, ans=0.1 2023-10-10 02:40:34,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=213103.33333333334, ans=0.0 2023-10-10 02:40:38,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=213103.33333333334, ans=0.0 2023-10-10 02:40:41,927 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.15 vs. limit=15.0 2023-10-10 02:40:57,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=213196.66666666666, ans=0.125 2023-10-10 02:41:03,577 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:41:44,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.697e+02 1.996e+02 2.270e+02 3.294e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-10 02:41:45,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=213383.33333333334, ans=0.1 2023-10-10 02:41:55,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=213430.0, ans=0.125 2023-10-10 02:42:13,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=213523.33333333334, ans=0.2 2023-10-10 02:42:29,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=213616.66666666666, ans=0.05 2023-10-10 02:42:44,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=213663.33333333334, ans=0.0 2023-10-10 02:42:50,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=213663.33333333334, ans=0.125 2023-10-10 02:42:57,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=213710.0, ans=0.2 2023-10-10 02:43:20,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=213803.33333333334, ans=0.125 2023-10-10 02:43:26,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=213803.33333333334, ans=0.125 2023-10-10 02:43:37,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.779e+02 2.094e+02 2.547e+02 3.384e+02, threshold=4.188e+02, percent-clipped=0.0 2023-10-10 02:43:40,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=213896.66666666666, ans=0.125 2023-10-10 02:43:45,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=213896.66666666666, ans=0.0 2023-10-10 02:43:52,270 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:43:53,135 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:43:53,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=213943.33333333334, ans=0.05 2023-10-10 02:43:53,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=213943.33333333334, ans=0.125 2023-10-10 02:43:53,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=213943.33333333334, ans=0.0 2023-10-10 02:43:56,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=213943.33333333334, ans=0.125 2023-10-10 02:44:06,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=213990.0, ans=0.125 2023-10-10 02:44:14,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=214036.66666666666, ans=0.2 2023-10-10 02:44:21,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=214036.66666666666, ans=0.125 2023-10-10 02:44:23,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=214036.66666666666, ans=0.0 2023-10-10 02:44:29,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=214083.33333333334, ans=0.05 2023-10-10 02:44:38,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=214130.0, ans=0.1 2023-10-10 02:44:41,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-10-10 02:44:53,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=214176.66666666666, ans=0.0 2023-10-10 02:45:04,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.95 vs. limit=15.0 2023-10-10 02:45:19,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=214270.0, ans=0.1 2023-10-10 02:45:23,111 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.97 vs. limit=10.0 2023-10-10 02:45:29,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.683e+02 1.835e+02 2.046e+02 3.436e+02, threshold=3.670e+02, percent-clipped=0.0 2023-10-10 02:45:41,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=214363.33333333334, ans=0.125 2023-10-10 02:45:50,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=15.0 2023-10-10 02:45:52,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=214456.66666666666, ans=0.0 2023-10-10 02:45:57,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=214456.66666666666, ans=0.0 2023-10-10 02:46:04,282 INFO [train.py:1031] (0/4) Epoch 4, batch 5000, loss[loss=0.2438, simple_loss=0.3308, pruned_loss=0.07843, over 17003.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3258, pruned_loss=0.08449, over 30082453.95 frames. ], batch size: 117, lr: 1.02e-02, grad_scale: 32.0 2023-10-10 02:46:16,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=214550.0, ans=0.04949747468305833 2023-10-10 02:46:22,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=214550.0, ans=0.125 2023-10-10 02:46:41,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=214643.33333333334, ans=0.09899494936611666 2023-10-10 02:46:55,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=214690.0, ans=0.0 2023-10-10 02:47:19,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.706e+02 1.872e+02 2.127e+02 3.239e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-10 02:47:24,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=214830.0, ans=0.125 2023-10-10 02:47:32,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=22.5 2023-10-10 02:47:53,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=214923.33333333334, ans=0.125 2023-10-10 02:47:55,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214923.33333333334, ans=0.1 2023-10-10 02:48:20,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=215063.33333333334, ans=0.125 2023-10-10 02:48:35,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=215110.0, ans=0.0 2023-10-10 02:48:39,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=215110.0, ans=0.125 2023-10-10 02:49:12,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.275e+02 1.784e+02 2.019e+02 2.257e+02 2.952e+02, threshold=4.039e+02, percent-clipped=0.0 2023-10-10 02:49:12,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=215250.0, ans=0.0 2023-10-10 02:49:33,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=215390.0, ans=0.025 2023-10-10 02:49:49,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=215436.66666666666, ans=10.0 2023-10-10 02:50:09,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=215530.0, ans=0.125 2023-10-10 02:50:12,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=215530.0, ans=0.125 2023-10-10 02:50:13,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=215530.0, ans=0.125 2023-10-10 02:50:20,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215576.66666666666, ans=0.1 2023-10-10 02:50:38,643 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=15.0 2023-10-10 02:50:46,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=215670.0, ans=0.2 2023-10-10 02:51:03,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.821e+02 2.072e+02 2.469e+02 3.778e+02, threshold=4.145e+02, percent-clipped=0.0 2023-10-10 02:51:17,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.14 vs. limit=10.0 2023-10-10 02:51:35,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=215856.66666666666, ans=0.125 2023-10-10 02:52:10,176 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:52:13,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=215996.66666666666, ans=0.125 2023-10-10 02:52:17,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=216043.33333333334, ans=0.0 2023-10-10 02:52:19,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216043.33333333334, ans=0.1 2023-10-10 02:52:36,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=216090.0, ans=0.125 2023-10-10 02:52:40,654 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-10-10 02:52:59,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.706e+02 1.947e+02 2.292e+02 4.215e+02, threshold=3.894e+02, percent-clipped=1.0 2023-10-10 02:53:06,839 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.68 vs. limit=15.0 2023-10-10 02:53:27,029 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-10-10 02:53:40,862 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-10-10 02:53:45,267 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.27 vs. limit=15.0 2023-10-10 02:54:02,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=216510.0, ans=0.125 2023-10-10 02:54:02,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-10-10 02:54:03,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=216510.0, ans=0.0 2023-10-10 02:54:06,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=216510.0, ans=0.05 2023-10-10 02:54:09,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=216510.0, ans=0.09899494936611666 2023-10-10 02:54:24,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=216603.33333333334, ans=0.0 2023-10-10 02:54:45,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.724e+02 1.893e+02 2.092e+02 3.210e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-10 02:55:02,922 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.23 vs. limit=22.5 2023-10-10 02:55:12,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=216790.0, ans=0.0 2023-10-10 02:55:19,817 INFO [train.py:1031] (0/4) Epoch 4, batch 5500, loss[loss=0.2826, simple_loss=0.3445, pruned_loss=0.1103, over 15646.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3253, pruned_loss=0.08389, over 30712470.12 frames. ], batch size: 350, lr: 1.01e-02, grad_scale: 32.0 2023-10-10 02:55:27,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=216836.66666666666, ans=0.04949747468305833 2023-10-10 02:55:30,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=216883.33333333334, ans=0.07 2023-10-10 02:55:42,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=216930.0, ans=0.2 2023-10-10 02:55:48,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=216930.0, ans=0.125 2023-10-10 02:55:53,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=216976.66666666666, ans=0.2 2023-10-10 02:55:56,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=216976.66666666666, ans=0.0 2023-10-10 02:55:57,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=216976.66666666666, ans=0.04949747468305833 2023-10-10 02:56:02,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=217023.33333333334, ans=0.1 2023-10-10 02:56:02,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.59 vs. limit=15.0 2023-10-10 02:56:26,758 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.73 vs. limit=15.0 2023-10-10 02:56:33,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.755e+02 1.980e+02 2.233e+02 3.178e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-10 02:56:41,842 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.24 vs. limit=22.5 2023-10-10 02:56:55,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=217256.66666666666, ans=0.1 2023-10-10 02:56:55,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=217256.66666666666, ans=0.125 2023-10-10 02:57:19,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=217350.0, ans=0.0 2023-10-10 02:57:19,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=217350.0, ans=0.125 2023-10-10 02:57:22,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=217350.0, ans=0.125 2023-10-10 02:57:29,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=217396.66666666666, ans=0.1 2023-10-10 02:57:37,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.76 vs. limit=12.0 2023-10-10 02:58:03,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.91 vs. limit=15.0 2023-10-10 02:58:09,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=217536.66666666666, ans=0.95 2023-10-10 02:58:17,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=217583.33333333334, ans=0.0 2023-10-10 02:58:22,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.939e+02 2.178e+02 2.611e+02 3.893e+02, threshold=4.357e+02, percent-clipped=0.0 2023-10-10 02:58:38,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=217676.66666666666, ans=0.0 2023-10-10 02:59:03,298 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.65 vs. limit=22.5 2023-10-10 02:59:10,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=217816.66666666666, ans=0.1 2023-10-10 02:59:12,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.03 vs. limit=22.5 2023-10-10 02:59:13,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=217816.66666666666, ans=0.0 2023-10-10 02:59:25,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=217863.33333333334, ans=0.1 2023-10-10 02:59:32,764 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 02:59:48,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=217956.66666666666, ans=0.1 2023-10-10 02:59:50,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=217956.66666666666, ans=0.1 2023-10-10 02:59:51,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.87 vs. limit=22.5 2023-10-10 02:59:58,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=218003.33333333334, ans=10.0 2023-10-10 03:00:05,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=218050.0, ans=0.125 2023-10-10 03:00:12,615 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.27 vs. limit=10.0 2023-10-10 03:00:14,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.721e+02 1.922e+02 2.159e+02 3.016e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-10 03:00:14,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=218096.66666666666, ans=0.125 2023-10-10 03:00:15,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=218096.66666666666, ans=0.0 2023-10-10 03:00:18,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=218096.66666666666, ans=0.1 2023-10-10 03:00:29,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=218143.33333333334, ans=0.1 2023-10-10 03:00:51,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=218236.66666666666, ans=0.1 2023-10-10 03:01:00,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=218283.33333333334, ans=0.2 2023-10-10 03:01:06,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=218283.33333333334, ans=0.0 2023-10-10 03:01:06,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=218283.33333333334, ans=0.0 2023-10-10 03:01:12,603 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.97 vs. limit=22.5 2023-10-10 03:01:23,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=218376.66666666666, ans=0.125 2023-10-10 03:01:30,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=218376.66666666666, ans=0.1 2023-10-10 03:01:32,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=218376.66666666666, ans=0.0 2023-10-10 03:01:35,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=218423.33333333334, ans=0.125 2023-10-10 03:01:35,839 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.91 vs. limit=15.0 2023-10-10 03:01:38,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=218423.33333333334, ans=0.1 2023-10-10 03:01:39,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=218423.33333333334, ans=0.1 2023-10-10 03:01:45,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=218470.0, ans=0.0 2023-10-10 03:01:56,121 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-10-10 03:01:59,655 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.84 vs. limit=10.0 2023-10-10 03:02:05,198 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.47 vs. limit=22.5 2023-10-10 03:02:05,576 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.738e+02 1.942e+02 2.193e+02 3.532e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-10 03:02:09,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=218563.33333333334, ans=0.125 2023-10-10 03:02:51,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.16 vs. limit=10.0 2023-10-10 03:03:36,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=218890.0, ans=0.125 2023-10-10 03:03:44,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=218936.66666666666, ans=0.0 2023-10-10 03:03:57,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=218983.33333333334, ans=0.125 2023-10-10 03:03:59,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.773e+02 1.967e+02 2.314e+02 3.785e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-10 03:04:13,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.96 vs. limit=22.5 2023-10-10 03:04:15,094 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:04:19,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=219123.33333333334, ans=0.1 2023-10-10 03:04:32,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=219170.0, ans=0.125 2023-10-10 03:04:33,180 INFO [train.py:1031] (0/4) Epoch 4, batch 6000, loss[loss=0.2454, simple_loss=0.3221, pruned_loss=0.0843, over 16502.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3253, pruned_loss=0.08407, over 31147135.42 frames. ], batch size: 61, lr: 1.00e-02, grad_scale: 32.0 2023-10-10 03:04:38,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-10 03:04:48,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=219216.66666666666, ans=0.1 2023-10-10 03:05:04,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=219263.33333333334, ans=0.07 2023-10-10 03:05:11,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=219310.0, ans=0.125 2023-10-10 03:05:34,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=219403.33333333334, ans=0.125 2023-10-10 03:05:50,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.744e+02 1.931e+02 2.397e+02 3.189e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-10 03:05:52,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=219496.66666666666, ans=0.02 2023-10-10 03:06:19,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=219590.0, ans=0.125 2023-10-10 03:06:20,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=219590.0, ans=0.0 2023-10-10 03:06:25,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=219636.66666666666, ans=0.0 2023-10-10 03:06:46,735 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.00 vs. limit=15.0 2023-10-10 03:06:54,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=15.0 2023-10-10 03:07:00,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=219776.66666666666, ans=0.2 2023-10-10 03:07:04,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219776.66666666666, ans=0.1 2023-10-10 03:07:10,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=219823.33333333334, ans=0.1 2023-10-10 03:07:33,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.77 vs. limit=15.0 2023-10-10 03:07:37,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=219963.33333333334, ans=0.1 2023-10-10 03:07:40,118 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.759e+02 2.070e+02 2.389e+02 3.227e+02, threshold=4.139e+02, percent-clipped=0.0 2023-10-10 03:08:02,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=8.0 2023-10-10 03:08:02,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=220056.66666666666, ans=0.125 2023-10-10 03:08:17,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=220103.33333333334, ans=0.125 2023-10-10 03:08:26,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220150.0, ans=0.1 2023-10-10 03:08:30,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=220150.0, ans=0.0 2023-10-10 03:08:38,128 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.37 vs. limit=22.5 2023-10-10 03:08:39,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=220196.66666666666, ans=0.04949747468305833 2023-10-10 03:08:52,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=220290.0, ans=0.09899494936611666 2023-10-10 03:09:12,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=220336.66666666666, ans=0.2 2023-10-10 03:09:20,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=220383.33333333334, ans=0.0 2023-10-10 03:09:27,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=220430.0, ans=0.125 2023-10-10 03:09:28,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.792e+02 1.935e+02 2.323e+02 3.493e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-10 03:09:56,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=220523.33333333334, ans=0.0 2023-10-10 03:09:58,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=220570.0, ans=0.2 2023-10-10 03:10:11,153 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:11:14,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=220850.0, ans=0.125 2023-10-10 03:11:26,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=220896.66666666666, ans=0.2 2023-10-10 03:11:29,319 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.874e+02 2.188e+02 2.548e+02 4.224e+02, threshold=4.376e+02, percent-clipped=2.0 2023-10-10 03:11:31,151 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=12.0 2023-10-10 03:12:00,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=221036.66666666666, ans=0.1 2023-10-10 03:12:13,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=221083.33333333334, ans=0.125 2023-10-10 03:12:33,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=221130.0, ans=0.125 2023-10-10 03:12:46,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=221223.33333333334, ans=0.5 2023-10-10 03:13:04,343 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.48 vs. limit=22.5 2023-10-10 03:13:12,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=221316.66666666666, ans=0.0 2023-10-10 03:13:14,670 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.17 vs. limit=15.0 2023-10-10 03:13:15,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=221316.66666666666, ans=0.125 2023-10-10 03:13:20,868 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.284e+02 1.696e+02 1.860e+02 2.222e+02 3.148e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-10 03:13:41,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=221410.0, ans=0.125 2023-10-10 03:13:42,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=221410.0, ans=0.1 2023-10-10 03:13:45,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=221456.66666666666, ans=0.1 2023-10-10 03:13:46,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=221456.66666666666, ans=0.2 2023-10-10 03:13:59,028 INFO [train.py:1031] (0/4) Epoch 4, batch 6500, loss[loss=0.2562, simple_loss=0.34, pruned_loss=0.08618, over 16899.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3257, pruned_loss=0.0841, over 31538995.08 frames. ], batch size: 188, lr: 1.00e-02, grad_scale: 16.0 2023-10-10 03:14:17,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=221550.0, ans=0.125 2023-10-10 03:14:31,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=221596.66666666666, ans=15.0 2023-10-10 03:14:37,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=221643.33333333334, ans=0.125 2023-10-10 03:15:01,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=221690.0, ans=0.1 2023-10-10 03:15:05,481 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.80 vs. limit=10.0 2023-10-10 03:15:15,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=221736.66666666666, ans=0.07 2023-10-10 03:15:19,615 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.13 vs. limit=15.0 2023-10-10 03:15:30,884 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.798e+02 2.057e+02 2.350e+02 2.984e+02, threshold=4.115e+02, percent-clipped=0.0 2023-10-10 03:15:33,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=221830.0, ans=0.125 2023-10-10 03:15:48,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.42 vs. limit=15.0 2023-10-10 03:15:51,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=221923.33333333334, ans=0.125 2023-10-10 03:15:53,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=221923.33333333334, ans=0.125 2023-10-10 03:16:11,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=221970.0, ans=0.07 2023-10-10 03:16:40,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=222110.0, ans=0.125 2023-10-10 03:16:45,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=222110.0, ans=0.0 2023-10-10 03:16:57,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=222203.33333333334, ans=0.0 2023-10-10 03:16:58,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222203.33333333334, ans=0.1 2023-10-10 03:17:10,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=222250.0, ans=0.0 2023-10-10 03:17:20,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.811e+02 2.033e+02 2.390e+02 3.612e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-10 03:17:51,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=222436.66666666666, ans=0.0 2023-10-10 03:18:09,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=222530.0, ans=0.125 2023-10-10 03:18:20,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=222530.0, ans=0.2 2023-10-10 03:18:51,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222670.0, ans=0.1 2023-10-10 03:19:11,593 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.820e+02 2.036e+02 2.562e+02 3.931e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-10 03:19:23,741 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.58 vs. limit=22.5 2023-10-10 03:19:24,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=222810.0, ans=0.125 2023-10-10 03:19:27,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=222810.0, ans=0.125 2023-10-10 03:19:29,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222810.0, ans=0.1 2023-10-10 03:19:31,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.whiten.whitening_limit, batch_count=222810.0, ans=12.0 2023-10-10 03:19:41,203 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:19:50,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=222903.33333333334, ans=0.125 2023-10-10 03:20:26,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=222996.66666666666, ans=0.125 2023-10-10 03:20:27,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=222996.66666666666, ans=0.0 2023-10-10 03:20:35,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=223043.33333333334, ans=0.125 2023-10-10 03:20:36,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.36 vs. limit=15.0 2023-10-10 03:20:42,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=223090.0, ans=0.125 2023-10-10 03:20:47,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=223090.0, ans=0.1 2023-10-10 03:21:18,840 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.715e+02 1.972e+02 2.494e+02 3.331e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-10 03:21:22,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=223230.0, ans=0.0 2023-10-10 03:21:29,621 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:21:30,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=223276.66666666666, ans=0.0 2023-10-10 03:21:31,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=223276.66666666666, ans=0.0 2023-10-10 03:21:40,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=223323.33333333334, ans=0.2 2023-10-10 03:21:47,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=223323.33333333334, ans=0.125 2023-10-10 03:22:20,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=223463.33333333334, ans=0.125 2023-10-10 03:22:35,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=223556.66666666666, ans=0.04949747468305833 2023-10-10 03:22:36,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.42 vs. limit=22.5 2023-10-10 03:22:37,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=223556.66666666666, ans=0.125 2023-10-10 03:22:37,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=223556.66666666666, ans=0.0 2023-10-10 03:22:37,908 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-10-10 03:22:38,697 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:22:40,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=223556.66666666666, ans=0.125 2023-10-10 03:22:46,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=223603.33333333334, ans=0.1 2023-10-10 03:23:06,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=223696.66666666666, ans=0.0 2023-10-10 03:23:06,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=223696.66666666666, ans=0.125 2023-10-10 03:23:07,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223696.66666666666, ans=0.1 2023-10-10 03:23:08,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.723e+02 1.980e+02 2.256e+02 3.852e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-10 03:23:34,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=223790.0, ans=0.0 2023-10-10 03:23:36,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.59 vs. limit=15.0 2023-10-10 03:23:37,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=223836.66666666666, ans=0.0 2023-10-10 03:23:37,849 INFO [train.py:1031] (0/4) Epoch 4, batch 7000, loss[loss=0.2598, simple_loss=0.3369, pruned_loss=0.09133, over 16838.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.326, pruned_loss=0.08384, over 31817396.36 frames. ], batch size: 146, lr: 9.95e-03, grad_scale: 16.0 2023-10-10 03:24:10,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=223930.0, ans=0.125 2023-10-10 03:24:10,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=223930.0, ans=0.0 2023-10-10 03:24:16,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.56 vs. limit=15.0 2023-10-10 03:24:20,837 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-48000.pt 2023-10-10 03:24:41,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=224070.0, ans=0.95 2023-10-10 03:24:41,666 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.10 vs. limit=22.5 2023-10-10 03:24:45,404 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:24:58,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=224116.66666666666, ans=0.05 2023-10-10 03:24:59,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.85 vs. limit=10.0 2023-10-10 03:25:04,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 2.036e+02 2.329e+02 2.722e+02 3.424e+02, threshold=4.659e+02, percent-clipped=0.0 2023-10-10 03:25:15,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.69 vs. limit=15.0 2023-10-10 03:25:28,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=224256.66666666666, ans=0.125 2023-10-10 03:25:32,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=22.5 2023-10-10 03:25:34,974 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:25:43,387 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.89 vs. limit=15.0 2023-10-10 03:25:57,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=224396.66666666666, ans=0.05 2023-10-10 03:26:01,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=224396.66666666666, ans=0.025 2023-10-10 03:26:29,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=224536.66666666666, ans=0.0 2023-10-10 03:26:42,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=224583.33333333334, ans=0.035 2023-10-10 03:26:49,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=224583.33333333334, ans=0.2 2023-10-10 03:26:50,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224583.33333333334, ans=0.1 2023-10-10 03:26:55,566 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.783e+02 2.126e+02 2.446e+02 3.555e+02, threshold=4.252e+02, percent-clipped=0.0 2023-10-10 03:27:01,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=224630.0, ans=0.07 2023-10-10 03:27:02,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=224676.66666666666, ans=0.0 2023-10-10 03:27:40,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224770.0, ans=0.1 2023-10-10 03:27:54,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=224816.66666666666, ans=0.0 2023-10-10 03:28:11,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=224910.0, ans=0.0 2023-10-10 03:28:20,266 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.57 vs. limit=22.5 2023-10-10 03:28:25,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=224956.66666666666, ans=0.125 2023-10-10 03:28:27,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.65 vs. limit=5.0 2023-10-10 03:28:45,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=225003.33333333334, ans=0.1 2023-10-10 03:28:57,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=225050.0, ans=0.0 2023-10-10 03:28:58,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=225096.66666666666, ans=0.0 2023-10-10 03:29:02,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.735e+02 1.944e+02 2.216e+02 3.505e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-10 03:29:13,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.40 vs. limit=15.0 2023-10-10 03:29:13,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=225143.33333333334, ans=0.07 2023-10-10 03:29:34,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=225236.66666666666, ans=0.0 2023-10-10 03:30:22,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=225423.33333333334, ans=0.1 2023-10-10 03:30:28,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=225423.33333333334, ans=0.125 2023-10-10 03:30:58,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.725e+02 1.965e+02 2.229e+02 3.831e+02, threshold=3.931e+02, percent-clipped=0.0 2023-10-10 03:31:11,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.69 vs. limit=15.0 2023-10-10 03:31:18,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=225656.66666666666, ans=0.2 2023-10-10 03:31:28,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=225703.33333333334, ans=0.0 2023-10-10 03:31:55,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=225796.66666666666, ans=0.2 2023-10-10 03:32:06,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=225843.33333333334, ans=0.0 2023-10-10 03:32:13,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=225890.0, ans=10.0 2023-10-10 03:32:18,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=22.5 2023-10-10 03:32:21,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.59 vs. limit=22.5 2023-10-10 03:32:29,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=225936.66666666666, ans=0.125 2023-10-10 03:32:34,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=1.96 vs. limit=15.0 2023-10-10 03:32:45,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.808e+02 1.976e+02 2.266e+02 3.262e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-10 03:32:49,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-10-10 03:32:59,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=226076.66666666666, ans=0.07 2023-10-10 03:33:01,353 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-10-10 03:33:14,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=226123.33333333334, ans=0.0 2023-10-10 03:33:18,278 INFO [train.py:1031] (0/4) Epoch 4, batch 7500, loss[loss=0.2272, simple_loss=0.3082, pruned_loss=0.07314, over 16565.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3256, pruned_loss=0.08364, over 32026185.18 frames. ], batch size: 66, lr: 9.90e-03, grad_scale: 32.0 2023-10-10 03:33:34,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=226216.66666666666, ans=0.0 2023-10-10 03:33:35,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=226216.66666666666, ans=0.2 2023-10-10 03:33:51,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=226310.0, ans=0.125 2023-10-10 03:33:52,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=226310.0, ans=0.0 2023-10-10 03:33:52,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=226310.0, ans=0.125 2023-10-10 03:34:20,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=226403.33333333334, ans=0.015 2023-10-10 03:34:31,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=226450.0, ans=0.125 2023-10-10 03:34:33,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=226496.66666666666, ans=0.95 2023-10-10 03:34:38,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.862e+02 2.261e+02 2.638e+02 3.951e+02, threshold=4.522e+02, percent-clipped=0.0 2023-10-10 03:34:55,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=226590.0, ans=0.125 2023-10-10 03:35:08,617 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:35:45,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.98 vs. limit=15.0 2023-10-10 03:36:06,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=226823.33333333334, ans=0.125 2023-10-10 03:36:09,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=226823.33333333334, ans=0.0 2023-10-10 03:36:35,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=226963.33333333334, ans=0.2 2023-10-10 03:36:39,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.749e+02 1.966e+02 2.272e+02 3.303e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-10 03:36:45,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=227010.0, ans=0.125 2023-10-10 03:36:58,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=227056.66666666666, ans=0.125 2023-10-10 03:37:03,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=227056.66666666666, ans=0.125 2023-10-10 03:37:05,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=227056.66666666666, ans=0.125 2023-10-10 03:37:10,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=227103.33333333334, ans=0.0 2023-10-10 03:37:10,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=227103.33333333334, ans=0.0 2023-10-10 03:37:22,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=227150.0, ans=0.2 2023-10-10 03:37:28,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=227150.0, ans=0.2 2023-10-10 03:37:31,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=227196.66666666666, ans=0.125 2023-10-10 03:37:33,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227196.66666666666, ans=0.1 2023-10-10 03:37:35,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.22 vs. limit=10.0 2023-10-10 03:37:45,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=227243.33333333334, ans=0.0 2023-10-10 03:38:04,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227336.66666666666, ans=0.1 2023-10-10 03:38:11,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.28 vs. limit=22.5 2023-10-10 03:38:17,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=227383.33333333334, ans=0.125 2023-10-10 03:38:22,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=227383.33333333334, ans=0.125 2023-10-10 03:38:30,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.760e+02 1.933e+02 2.192e+02 2.980e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-10 03:38:30,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=227430.0, ans=0.125 2023-10-10 03:39:07,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227570.0, ans=0.1 2023-10-10 03:39:10,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=227616.66666666666, ans=0.0 2023-10-10 03:39:26,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=227663.33333333334, ans=0.09899494936611666 2023-10-10 03:39:32,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=227663.33333333334, ans=0.125 2023-10-10 03:39:37,575 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=15.0 2023-10-10 03:40:08,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=227803.33333333334, ans=0.04949747468305833 2023-10-10 03:40:09,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=227803.33333333334, ans=0.0 2023-10-10 03:40:26,998 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.776e+02 1.999e+02 2.186e+02 3.042e+02, threshold=3.998e+02, percent-clipped=0.0 2023-10-10 03:40:33,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=227943.33333333334, ans=0.07 2023-10-10 03:41:28,498 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.82 vs. limit=15.0 2023-10-10 03:41:28,523 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.72 vs. limit=22.5 2023-10-10 03:41:41,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=228223.33333333334, ans=0.125 2023-10-10 03:41:49,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=228223.33333333334, ans=0.125 2023-10-10 03:41:52,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=228270.0, ans=0.125 2023-10-10 03:42:03,927 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.69 vs. limit=15.0 2023-10-10 03:42:06,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=228316.66666666666, ans=0.125 2023-10-10 03:42:12,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=228316.66666666666, ans=0.125 2023-10-10 03:42:20,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.704e+02 1.926e+02 2.230e+02 2.916e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-10 03:42:21,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=228363.33333333334, ans=0.125 2023-10-10 03:42:31,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=228410.0, ans=0.1 2023-10-10 03:42:31,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=228410.0, ans=0.125 2023-10-10 03:42:32,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228410.0, ans=0.1 2023-10-10 03:42:51,985 INFO [train.py:1031] (0/4) Epoch 4, batch 8000, loss[loss=0.2212, simple_loss=0.3054, pruned_loss=0.0685, over 16931.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3247, pruned_loss=0.08272, over 32242209.51 frames. ], batch size: 77, lr: 9.85e-03, grad_scale: 32.0 2023-10-10 03:42:59,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=228503.33333333334, ans=0.0 2023-10-10 03:42:59,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=228503.33333333334, ans=0.125 2023-10-10 03:43:05,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=228550.0, ans=0.125 2023-10-10 03:43:18,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=228596.66666666666, ans=0.0 2023-10-10 03:43:25,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=228643.33333333334, ans=0.2 2023-10-10 03:43:51,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=228736.66666666666, ans=0.125 2023-10-10 03:43:55,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=228783.33333333334, ans=0.125 2023-10-10 03:43:56,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-10-10 03:43:58,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=228783.33333333334, ans=0.125 2023-10-10 03:44:04,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228783.33333333334, ans=0.1 2023-10-10 03:44:05,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228783.33333333334, ans=0.1 2023-10-10 03:44:10,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=228830.0, ans=0.125 2023-10-10 03:44:11,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.799e+02 2.032e+02 2.377e+02 3.642e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-10 03:44:41,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=228970.0, ans=0.125 2023-10-10 03:44:43,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-10-10 03:44:43,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=228970.0, ans=0.05 2023-10-10 03:45:31,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=229203.33333333334, ans=0.0 2023-10-10 03:46:02,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229296.66666666666, ans=0.1 2023-10-10 03:46:10,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.852e+02 2.105e+02 2.392e+02 3.838e+02, threshold=4.209e+02, percent-clipped=0.0 2023-10-10 03:46:13,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=229296.66666666666, ans=0.2 2023-10-10 03:46:19,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=229343.33333333334, ans=0.125 2023-10-10 03:46:22,426 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.86 vs. limit=10.0 2023-10-10 03:46:34,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=229390.0, ans=0.0 2023-10-10 03:46:34,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=229390.0, ans=0.0 2023-10-10 03:46:47,218 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.56 vs. limit=22.5 2023-10-10 03:47:05,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=229530.0, ans=0.1 2023-10-10 03:47:10,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=229530.0, ans=0.125 2023-10-10 03:47:33,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=229623.33333333334, ans=0.0 2023-10-10 03:48:05,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=229763.33333333334, ans=0.125 2023-10-10 03:48:06,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.789e+02 1.998e+02 2.469e+02 3.648e+02, threshold=3.996e+02, percent-clipped=0.0 2023-10-10 03:48:10,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.39 vs. limit=10.0 2023-10-10 03:48:31,764 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.694e-03 2023-10-10 03:49:05,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=229996.66666666666, ans=0.125 2023-10-10 03:49:14,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=230043.33333333334, ans=0.2 2023-10-10 03:49:20,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=230090.0, ans=0.125 2023-10-10 03:49:29,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=230090.0, ans=0.0 2023-10-10 03:49:58,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.818e+02 2.057e+02 2.273e+02 3.432e+02, threshold=4.114e+02, percent-clipped=0.0 2023-10-10 03:50:21,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=230323.33333333334, ans=0.07 2023-10-10 03:50:24,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=230323.33333333334, ans=15.0 2023-10-10 03:50:28,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=230370.0, ans=0.2 2023-10-10 03:50:33,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=230370.0, ans=0.0 2023-10-10 03:50:50,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=230416.66666666666, ans=0.0 2023-10-10 03:50:59,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=230463.33333333334, ans=0.125 2023-10-10 03:51:11,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=230510.0, ans=0.0 2023-10-10 03:51:12,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=230510.0, ans=0.125 2023-10-10 03:51:26,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.47 vs. limit=15.0 2023-10-10 03:51:44,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=230650.0, ans=0.125 2023-10-10 03:51:45,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=230650.0, ans=0.125 2023-10-10 03:51:54,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.747e+02 2.004e+02 2.385e+02 3.225e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-10 03:51:54,366 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:52:04,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=230743.33333333334, ans=0.125 2023-10-10 03:52:28,046 INFO [train.py:1031] (0/4) Epoch 4, batch 8500, loss[loss=0.2316, simple_loss=0.2878, pruned_loss=0.08767, over 12838.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3249, pruned_loss=0.08245, over 32396839.77 frames. ], batch size: 440, lr: 9.80e-03, grad_scale: 32.0 2023-10-10 03:52:29,549 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.23 vs. limit=15.0 2023-10-10 03:52:31,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=230836.66666666666, ans=0.125 2023-10-10 03:52:32,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=230836.66666666666, ans=0.0 2023-10-10 03:52:37,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=230836.66666666666, ans=0.125 2023-10-10 03:53:10,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=230976.66666666666, ans=0.09899494936611666 2023-10-10 03:53:14,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.83 vs. limit=22.5 2023-10-10 03:53:22,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=231023.33333333334, ans=0.125 2023-10-10 03:53:24,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=231070.0, ans=0.125 2023-10-10 03:53:54,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 2.024e+02 2.239e+02 2.613e+02 4.171e+02, threshold=4.477e+02, percent-clipped=2.0 2023-10-10 03:54:02,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=231210.0, ans=0.0 2023-10-10 03:54:27,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=231303.33333333334, ans=0.125 2023-10-10 03:54:33,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=231303.33333333334, ans=0.0 2023-10-10 03:54:43,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=231350.0, ans=0.0 2023-10-10 03:54:54,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=231396.66666666666, ans=0.2 2023-10-10 03:55:00,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=231396.66666666666, ans=0.125 2023-10-10 03:55:04,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=231443.33333333334, ans=0.125 2023-10-10 03:55:08,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=231443.33333333334, ans=0.125 2023-10-10 03:55:13,197 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=12.0 2023-10-10 03:55:14,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=231490.0, ans=0.125 2023-10-10 03:55:27,848 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-10-10 03:55:33,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=231536.66666666666, ans=0.2 2023-10-10 03:55:40,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=231583.33333333334, ans=0.0 2023-10-10 03:55:42,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=231583.33333333334, ans=0.125 2023-10-10 03:55:56,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.708e+02 1.890e+02 2.186e+02 3.030e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-10 03:56:08,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=231676.66666666666, ans=0.0 2023-10-10 03:56:08,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=231676.66666666666, ans=0.125 2023-10-10 03:56:26,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.46 vs. limit=22.5 2023-10-10 03:56:56,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.39 vs. limit=15.0 2023-10-10 03:57:18,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=231956.66666666666, ans=0.125 2023-10-10 03:57:43,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232050.0, ans=0.1 2023-10-10 03:57:54,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=232096.66666666666, ans=0.09899494936611666 2023-10-10 03:57:58,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.651e+02 1.866e+02 2.101e+02 3.312e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-10 03:58:11,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=232143.33333333334, ans=0.2 2023-10-10 03:58:14,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=232190.0, ans=0.125 2023-10-10 03:58:25,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=232236.66666666666, ans=0.0 2023-10-10 03:58:37,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=232283.33333333334, ans=0.2 2023-10-10 03:58:45,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=232330.0, ans=0.0 2023-10-10 03:58:48,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232330.0, ans=0.1 2023-10-10 03:59:10,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=232423.33333333334, ans=0.125 2023-10-10 03:59:11,329 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 03:59:18,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=232470.0, ans=0.0 2023-10-10 03:59:22,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=232470.0, ans=0.125 2023-10-10 03:59:42,793 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.765e+02 1.915e+02 2.217e+02 3.216e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-10 04:00:01,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=232656.66666666666, ans=0.125 2023-10-10 04:00:15,785 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.46 vs. limit=12.0 2023-10-10 04:00:38,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=232796.66666666666, ans=0.0 2023-10-10 04:00:39,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.73 vs. limit=22.5 2023-10-10 04:00:39,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=232796.66666666666, ans=0.125 2023-10-10 04:00:45,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=232843.33333333334, ans=0.125 2023-10-10 04:00:50,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=232843.33333333334, ans=0.025 2023-10-10 04:00:52,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=232890.0, ans=0.1 2023-10-10 04:00:53,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=232890.0, ans=0.125 2023-10-10 04:00:57,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232890.0, ans=0.1 2023-10-10 04:01:08,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=232936.66666666666, ans=0.2 2023-10-10 04:01:34,038 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.886e+02 2.150e+02 2.647e+02 3.755e+02, threshold=4.299e+02, percent-clipped=0.0 2023-10-10 04:01:40,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=233076.66666666666, ans=0.125 2023-10-10 04:01:54,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=233123.33333333334, ans=0.025 2023-10-10 04:01:58,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=233170.0, ans=0.2 2023-10-10 04:01:59,216 INFO [train.py:1031] (0/4) Epoch 4, batch 9000, loss[loss=0.2394, simple_loss=0.3289, pruned_loss=0.07494, over 16708.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.324, pruned_loss=0.08215, over 32481308.84 frames. ], batch size: 81, lr: 9.75e-03, grad_scale: 16.0 2023-10-10 04:02:04,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=233170.0, ans=0.07 2023-10-10 04:02:19,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=233216.66666666666, ans=0.125 2023-10-10 04:02:21,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=233263.33333333334, ans=0.125 2023-10-10 04:02:21,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=233263.33333333334, ans=0.125 2023-10-10 04:02:42,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=233356.66666666666, ans=0.125 2023-10-10 04:02:45,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=233356.66666666666, ans=0.0 2023-10-10 04:03:02,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-10-10 04:03:05,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=233450.0, ans=0.0 2023-10-10 04:03:15,128 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.09 vs. limit=15.0 2023-10-10 04:03:17,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.810e+02 2.071e+02 2.319e+02 3.313e+02, threshold=4.143e+02, percent-clipped=0.0 2023-10-10 04:04:17,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=233776.66666666666, ans=0.1 2023-10-10 04:04:50,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=233916.66666666666, ans=0.125 2023-10-10 04:05:00,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=233963.33333333334, ans=0.125 2023-10-10 04:05:04,388 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.852e+02 2.032e+02 2.274e+02 3.789e+02, threshold=4.063e+02, percent-clipped=0.0 2023-10-10 04:05:09,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten.whitening_limit, batch_count=234010.0, ans=15.0 2023-10-10 04:05:17,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=234056.66666666666, ans=0.1 2023-10-10 04:05:19,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234056.66666666666, ans=0.1 2023-10-10 04:05:20,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=234056.66666666666, ans=0.0 2023-10-10 04:05:21,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=234056.66666666666, ans=0.125 2023-10-10 04:05:24,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=234056.66666666666, ans=0.0 2023-10-10 04:05:24,871 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.66 vs. limit=22.5 2023-10-10 04:05:53,967 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:06:00,129 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:06:21,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=234336.66666666666, ans=0.125 2023-10-10 04:06:27,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=234336.66666666666, ans=0.0 2023-10-10 04:06:29,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=234383.33333333334, ans=0.125 2023-10-10 04:06:39,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=234383.33333333334, ans=0.0 2023-10-10 04:06:42,161 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-10-10 04:06:47,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.850e+02 2.060e+02 2.509e+02 3.758e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-10 04:06:53,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=234476.66666666666, ans=0.125 2023-10-10 04:06:59,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234476.66666666666, ans=0.1 2023-10-10 04:07:00,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=234476.66666666666, ans=0.2 2023-10-10 04:07:10,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=234523.33333333334, ans=0.125 2023-10-10 04:07:13,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234570.0, ans=0.1 2023-10-10 04:07:14,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=234570.0, ans=0.0 2023-10-10 04:07:20,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=234570.0, ans=0.2 2023-10-10 04:07:35,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=234663.33333333334, ans=0.2 2023-10-10 04:07:36,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=234663.33333333334, ans=0.2 2023-10-10 04:07:39,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=234663.33333333334, ans=0.125 2023-10-10 04:08:02,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=234756.66666666666, ans=0.125 2023-10-10 04:08:32,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=234850.0, ans=0.125 2023-10-10 04:08:32,788 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=12.0 2023-10-10 04:08:47,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.885e+02 2.132e+02 2.388e+02 3.548e+02, threshold=4.263e+02, percent-clipped=0.0 2023-10-10 04:08:53,413 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.20 vs. limit=15.0 2023-10-10 04:08:59,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=234943.33333333334, ans=0.125 2023-10-10 04:09:05,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234990.0, ans=0.1 2023-10-10 04:09:14,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=235036.66666666666, ans=0.125 2023-10-10 04:09:23,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=235083.33333333334, ans=0.0 2023-10-10 04:09:38,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=235130.0, ans=0.0 2023-10-10 04:09:43,032 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.13 vs. limit=12.0 2023-10-10 04:09:52,411 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-10-10 04:09:56,924 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.03 vs. limit=15.0 2023-10-10 04:10:08,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=235270.0, ans=22.5 2023-10-10 04:10:20,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=235316.66666666666, ans=0.5 2023-10-10 04:10:30,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=235363.33333333334, ans=0.125 2023-10-10 04:10:39,194 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.815e+02 1.961e+02 2.330e+02 3.458e+02, threshold=3.921e+02, percent-clipped=0.0 2023-10-10 04:10:39,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=235363.33333333334, ans=0.125 2023-10-10 04:10:41,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=235363.33333333334, ans=0.125 2023-10-10 04:10:44,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=235410.0, ans=0.125 2023-10-10 04:10:48,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=235410.0, ans=0.125 2023-10-10 04:11:08,301 INFO [train.py:1031] (0/4) Epoch 4, batch 9500, loss[loss=0.2554, simple_loss=0.3435, pruned_loss=0.08361, over 16675.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3248, pruned_loss=0.08256, over 32542416.39 frames. ], batch size: 241, lr: 9.70e-03, grad_scale: 32.0 2023-10-10 04:11:15,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=235503.33333333334, ans=0.1 2023-10-10 04:11:21,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=235550.0, ans=0.0 2023-10-10 04:11:27,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=235550.0, ans=0.125 2023-10-10 04:11:33,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=235596.66666666666, ans=0.0 2023-10-10 04:12:03,781 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:12:15,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=235783.33333333334, ans=0.125 2023-10-10 04:12:26,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=235830.0, ans=0.0 2023-10-10 04:12:30,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.840e+02 2.026e+02 2.333e+02 4.133e+02, threshold=4.052e+02, percent-clipped=1.0 2023-10-10 04:12:57,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=235970.0, ans=0.1 2023-10-10 04:13:09,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=236016.66666666666, ans=0.025 2023-10-10 04:13:10,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=236016.66666666666, ans=0.1 2023-10-10 04:13:28,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=236063.33333333334, ans=0.125 2023-10-10 04:14:11,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=236250.0, ans=0.0 2023-10-10 04:14:22,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.773e+02 2.111e+02 2.427e+02 4.167e+02, threshold=4.221e+02, percent-clipped=1.0 2023-10-10 04:14:49,705 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:14:49,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=236436.66666666666, ans=0.0 2023-10-10 04:14:58,102 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.055e-02 2023-10-10 04:15:22,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=236576.66666666666, ans=0.1 2023-10-10 04:15:23,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.00 vs. limit=15.0 2023-10-10 04:15:24,392 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:15:25,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=236576.66666666666, ans=0.0 2023-10-10 04:15:37,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=236623.33333333334, ans=0.035 2023-10-10 04:15:55,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=236716.66666666666, ans=0.0 2023-10-10 04:16:03,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=236716.66666666666, ans=0.0 2023-10-10 04:16:11,651 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.787e+02 1.935e+02 2.241e+02 2.810e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-10 04:16:12,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=236763.33333333334, ans=15.0 2023-10-10 04:16:18,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=236810.0, ans=0.1 2023-10-10 04:16:33,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=236856.66666666666, ans=0.125 2023-10-10 04:16:41,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.90 vs. limit=10.0 2023-10-10 04:16:44,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=236903.33333333334, ans=0.2 2023-10-10 04:16:50,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=236950.0, ans=0.125 2023-10-10 04:16:52,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=236950.0, ans=0.035 2023-10-10 04:17:00,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=236996.66666666666, ans=0.95 2023-10-10 04:17:16,233 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.45 vs. limit=15.0 2023-10-10 04:17:16,358 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.28 vs. limit=15.0 2023-10-10 04:17:17,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.89 vs. limit=22.5 2023-10-10 04:17:18,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=237043.33333333334, ans=0.125 2023-10-10 04:17:22,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=237090.0, ans=0.0 2023-10-10 04:17:24,756 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-10-10 04:17:35,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=237136.66666666666, ans=0.0 2023-10-10 04:17:39,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-10-10 04:17:49,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=237183.33333333334, ans=0.2 2023-10-10 04:17:57,796 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.46 vs. limit=15.0 2023-10-10 04:18:01,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.719e+02 1.865e+02 2.083e+02 3.014e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-10 04:18:05,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=237276.66666666666, ans=0.1 2023-10-10 04:18:39,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=237416.66666666666, ans=0.05 2023-10-10 04:18:41,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=237416.66666666666, ans=0.125 2023-10-10 04:19:01,973 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:19:13,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-10-10 04:19:15,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=237556.66666666666, ans=0.0 2023-10-10 04:19:18,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=237556.66666666666, ans=0.125 2023-10-10 04:19:18,504 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.36 vs. limit=10.0 2023-10-10 04:19:24,002 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:19:48,410 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.767e+02 1.940e+02 2.292e+02 3.148e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-10 04:19:54,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=237743.33333333334, ans=0.2 2023-10-10 04:20:07,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=237790.0, ans=0.025 2023-10-10 04:20:12,425 INFO [train.py:1031] (0/4) Epoch 4, batch 10000, loss[loss=0.2348, simple_loss=0.3105, pruned_loss=0.07953, over 16528.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3237, pruned_loss=0.08192, over 32608183.33 frames. ], batch size: 266, lr: 9.66e-03, grad_scale: 32.0 2023-10-10 04:20:29,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=237883.33333333334, ans=0.2 2023-10-10 04:20:39,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.39 vs. limit=22.5 2023-10-10 04:21:01,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238023.33333333334, ans=0.1 2023-10-10 04:21:11,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=238070.0, ans=0.0 2023-10-10 04:21:11,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2023-10-10 04:21:38,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.748e+02 1.967e+02 2.205e+02 3.261e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-10 04:21:48,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=238210.0, ans=0.0 2023-10-10 04:21:49,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=238210.0, ans=0.125 2023-10-10 04:21:51,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=238210.0, ans=0.125 2023-10-10 04:21:57,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=238256.66666666666, ans=0.125 2023-10-10 04:22:02,185 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.47 vs. limit=6.0 2023-10-10 04:22:26,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=238350.0, ans=0.0 2023-10-10 04:22:45,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=238443.33333333334, ans=0.125 2023-10-10 04:22:49,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=238490.0, ans=0.125 2023-10-10 04:22:57,161 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-10-10 04:23:25,583 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.768e+02 1.963e+02 2.245e+02 3.262e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-10 04:23:35,046 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=12.0 2023-10-10 04:23:36,829 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=15.0 2023-10-10 04:23:37,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=238676.66666666666, ans=0.125 2023-10-10 04:23:42,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=238723.33333333334, ans=0.125 2023-10-10 04:24:27,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=238863.33333333334, ans=0.0 2023-10-10 04:24:29,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=238910.0, ans=0.125 2023-10-10 04:24:34,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.55 vs. limit=15.0 2023-10-10 04:24:36,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=238910.0, ans=0.125 2023-10-10 04:24:41,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=238956.66666666666, ans=0.09899494936611666 2023-10-10 04:24:42,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=238956.66666666666, ans=0.125 2023-10-10 04:24:49,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=239003.33333333334, ans=0.2 2023-10-10 04:25:03,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=239050.0, ans=0.125 2023-10-10 04:25:07,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239050.0, ans=0.1 2023-10-10 04:25:18,388 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=15.0 2023-10-10 04:25:22,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.793e+02 1.904e+02 2.266e+02 3.101e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-10 04:25:24,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=239096.66666666666, ans=0.05 2023-10-10 04:26:02,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=239283.33333333334, ans=0.1 2023-10-10 04:26:14,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=239330.0, ans=0.1 2023-10-10 04:26:27,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=239376.66666666666, ans=0.0 2023-10-10 04:26:29,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=239376.66666666666, ans=0.1 2023-10-10 04:27:19,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.291e+02 1.664e+02 1.873e+02 2.084e+02 2.779e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-10 04:27:19,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=239563.33333333334, ans=0.125 2023-10-10 04:27:50,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=239703.33333333334, ans=0.125 2023-10-10 04:27:55,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=239750.0, ans=0.125 2023-10-10 04:27:56,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=239750.0, ans=15.0 2023-10-10 04:28:19,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.48 vs. limit=22.5 2023-10-10 04:28:26,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=239843.33333333334, ans=0.125 2023-10-10 04:28:29,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=239843.33333333334, ans=0.0 2023-10-10 04:28:51,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239936.66666666666, ans=0.1 2023-10-10 04:28:51,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.83 vs. limit=12.0 2023-10-10 04:28:53,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239983.33333333334, ans=0.1 2023-10-10 04:28:54,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=239983.33333333334, ans=0.125 2023-10-10 04:29:07,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=240030.0, ans=0.0 2023-10-10 04:29:12,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.877e+02 2.043e+02 2.342e+02 3.951e+02, threshold=4.085e+02, percent-clipped=1.0 2023-10-10 04:29:14,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=240076.66666666666, ans=0.125 2023-10-10 04:29:35,579 INFO [train.py:1031] (0/4) Epoch 4, batch 10500, loss[loss=0.2199, simple_loss=0.3079, pruned_loss=0.0659, over 16919.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3238, pruned_loss=0.0817, over 32674966.81 frames. ], batch size: 77, lr: 9.61e-03, grad_scale: 32.0 2023-10-10 04:29:43,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=240170.0, ans=0.07 2023-10-10 04:30:33,325 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2023-10-10 04:31:06,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.791e+02 1.942e+02 2.135e+02 3.272e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-10 04:31:08,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=240543.33333333334, ans=0.025 2023-10-10 04:31:08,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=240543.33333333334, ans=0.2 2023-10-10 04:31:33,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=240636.66666666666, ans=0.125 2023-10-10 04:31:38,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=240636.66666666666, ans=0.125 2023-10-10 04:32:08,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=240776.66666666666, ans=0.125 2023-10-10 04:32:48,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.40 vs. limit=22.5 2023-10-10 04:32:52,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=240963.33333333334, ans=0.0 2023-10-10 04:32:59,383 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.55 vs. limit=15.0 2023-10-10 04:33:02,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.801e+02 1.986e+02 2.288e+02 3.141e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-10 04:33:03,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240963.33333333334, ans=0.1 2023-10-10 04:33:06,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=241010.0, ans=0.0 2023-10-10 04:33:16,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=241056.66666666666, ans=0.125 2023-10-10 04:33:26,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=241056.66666666666, ans=0.125 2023-10-10 04:33:46,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=241150.0, ans=0.04949747468305833 2023-10-10 04:33:46,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=241150.0, ans=0.125 2023-10-10 04:34:16,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.86 vs. limit=15.0 2023-10-10 04:34:17,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=241290.0, ans=0.04949747468305833 2023-10-10 04:34:23,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=241336.66666666666, ans=0.125 2023-10-10 04:34:26,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.63 vs. limit=15.0 2023-10-10 04:34:52,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=241430.0, ans=0.125 2023-10-10 04:34:56,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.761e+02 1.994e+02 2.358e+02 4.255e+02, threshold=3.987e+02, percent-clipped=1.0 2023-10-10 04:34:56,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=241430.0, ans=0.09899494936611666 2023-10-10 04:35:11,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=241523.33333333334, ans=0.125 2023-10-10 04:35:31,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=241616.66666666666, ans=0.125 2023-10-10 04:35:33,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=241616.66666666666, ans=0.0 2023-10-10 04:36:05,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=241756.66666666666, ans=0.5 2023-10-10 04:36:13,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=241803.33333333334, ans=0.125 2023-10-10 04:36:19,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=241803.33333333334, ans=0.0 2023-10-10 04:36:32,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=241850.0, ans=0.125 2023-10-10 04:36:35,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=241896.66666666666, ans=0.0 2023-10-10 04:36:36,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=241896.66666666666, ans=0.0 2023-10-10 04:36:44,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.867e+02 2.195e+02 2.598e+02 3.792e+02, threshold=4.389e+02, percent-clipped=0.0 2023-10-10 04:36:49,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=241943.33333333334, ans=0.0 2023-10-10 04:36:54,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=241943.33333333334, ans=0.125 2023-10-10 04:37:21,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=242083.33333333334, ans=0.0 2023-10-10 04:37:21,760 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.18 vs. limit=22.5 2023-10-10 04:37:51,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=242176.66666666666, ans=0.125 2023-10-10 04:38:07,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=242270.0, ans=0.1 2023-10-10 04:38:09,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=242270.0, ans=0.125 2023-10-10 04:38:14,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=242316.66666666666, ans=0.0 2023-10-10 04:38:18,322 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.30 vs. limit=22.5 2023-10-10 04:38:23,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=242316.66666666666, ans=0.0 2023-10-10 04:38:25,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-10-10 04:38:26,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=242363.33333333334, ans=0.2 2023-10-10 04:38:34,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.783e+02 2.042e+02 2.535e+02 4.634e+02, threshold=4.084e+02, percent-clipped=1.0 2023-10-10 04:38:42,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=242410.0, ans=0.1 2023-10-10 04:38:57,435 INFO [train.py:1031] (0/4) Epoch 4, batch 11000, loss[loss=0.2171, simple_loss=0.2984, pruned_loss=0.06788, over 16570.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3235, pruned_loss=0.08162, over 32677219.30 frames. ], batch size: 66, lr: 9.56e-03, grad_scale: 32.0 2023-10-10 04:38:57,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=242503.33333333334, ans=0.125 2023-10-10 04:38:57,978 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.70 vs. limit=10.0 2023-10-10 04:39:03,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=242503.33333333334, ans=0.0 2023-10-10 04:39:04,644 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.69 vs. limit=22.5 2023-10-10 04:39:16,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=242596.66666666666, ans=0.0 2023-10-10 04:39:23,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=242596.66666666666, ans=0.2 2023-10-10 04:39:28,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=242643.33333333334, ans=0.125 2023-10-10 04:39:47,173 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:39:49,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=242690.0, ans=0.125 2023-10-10 04:40:18,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=242830.0, ans=0.125 2023-10-10 04:40:24,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=242830.0, ans=0.07 2023-10-10 04:40:25,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.803e+02 2.098e+02 2.590e+02 4.045e+02, threshold=4.197e+02, percent-clipped=0.0 2023-10-10 04:40:26,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=242876.66666666666, ans=0.125 2023-10-10 04:40:26,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=242876.66666666666, ans=0.125 2023-10-10 04:40:30,083 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.05 vs. limit=10.0 2023-10-10 04:40:30,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=242876.66666666666, ans=0.125 2023-10-10 04:40:33,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=242876.66666666666, ans=0.0 2023-10-10 04:40:41,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=242923.33333333334, ans=0.125 2023-10-10 04:40:43,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=242923.33333333334, ans=0.025 2023-10-10 04:40:50,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.59 vs. limit=22.5 2023-10-10 04:41:04,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=243016.66666666666, ans=0.125 2023-10-10 04:41:09,981 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.49 vs. limit=22.5 2023-10-10 04:41:18,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=243063.33333333334, ans=0.1 2023-10-10 04:41:22,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=243063.33333333334, ans=0.5 2023-10-10 04:41:29,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=243110.0, ans=0.2 2023-10-10 04:41:38,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=243110.0, ans=0.125 2023-10-10 04:42:04,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=12.0 2023-10-10 04:42:14,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=243250.0, ans=0.125 2023-10-10 04:42:15,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=243250.0, ans=0.125 2023-10-10 04:42:18,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=243296.66666666666, ans=0.125 2023-10-10 04:42:28,878 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.899e+02 2.243e+02 2.653e+02 3.643e+02, threshold=4.486e+02, percent-clipped=0.0 2023-10-10 04:42:29,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=243343.33333333334, ans=0.125 2023-10-10 04:42:38,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.53 vs. limit=15.0 2023-10-10 04:42:50,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=243436.66666666666, ans=0.0 2023-10-10 04:43:11,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=243530.0, ans=0.0 2023-10-10 04:43:13,355 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-10-10 04:43:19,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=243530.0, ans=0.0 2023-10-10 04:43:25,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243576.66666666666, ans=0.1 2023-10-10 04:43:26,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=243576.66666666666, ans=0.2 2023-10-10 04:43:43,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=243670.0, ans=0.09899494936611666 2023-10-10 04:43:54,122 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.26 vs. limit=15.0 2023-10-10 04:44:19,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.811e+02 2.031e+02 2.306e+02 3.256e+02, threshold=4.062e+02, percent-clipped=0.0 2023-10-10 04:44:37,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=243856.66666666666, ans=0.09899494936611666 2023-10-10 04:45:00,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=243950.0, ans=0.0 2023-10-10 04:45:04,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=243950.0, ans=0.0 2023-10-10 04:45:07,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.73 vs. limit=6.0 2023-10-10 04:45:09,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.56 vs. limit=15.0 2023-10-10 04:45:22,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=244043.33333333334, ans=0.125 2023-10-10 04:45:35,839 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2023-10-10 04:45:47,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=244136.66666666666, ans=0.125 2023-10-10 04:46:02,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=244230.0, ans=0.1 2023-10-10 04:46:06,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=244230.0, ans=0.125 2023-10-10 04:46:11,686 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.701e+02 1.908e+02 2.179e+02 3.314e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-10 04:46:15,194 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.80 vs. limit=15.0 2023-10-10 04:46:19,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=244276.66666666666, ans=0.2 2023-10-10 04:46:27,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=244323.33333333334, ans=0.0 2023-10-10 04:46:29,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=244323.33333333334, ans=0.125 2023-10-10 04:46:37,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=244370.0, ans=0.125 2023-10-10 04:46:37,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=244370.0, ans=0.0 2023-10-10 04:46:43,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.72 vs. limit=15.0 2023-10-10 04:46:44,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=244370.0, ans=0.125 2023-10-10 04:46:45,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.61 vs. limit=12.0 2023-10-10 04:47:13,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=244510.0, ans=0.2 2023-10-10 04:47:38,175 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=12.0 2023-10-10 04:47:54,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=244696.66666666666, ans=10.0 2023-10-10 04:47:55,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=244696.66666666666, ans=22.5 2023-10-10 04:47:58,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=244696.66666666666, ans=0.125 2023-10-10 04:48:02,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=244696.66666666666, ans=0.125 2023-10-10 04:48:05,891 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.827e+02 2.026e+02 2.318e+02 3.857e+02, threshold=4.051e+02, percent-clipped=1.0 2023-10-10 04:48:21,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.73 vs. limit=15.0 2023-10-10 04:48:23,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244790.0, ans=0.1 2023-10-10 04:48:28,122 INFO [train.py:1031] (0/4) Epoch 4, batch 11500, loss[loss=0.2687, simple_loss=0.3407, pruned_loss=0.09837, over 16036.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3227, pruned_loss=0.08118, over 32707733.07 frames. ], batch size: 296, lr: 9.52e-03, grad_scale: 32.0 2023-10-10 04:48:30,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=244836.66666666666, ans=0.0 2023-10-10 04:48:36,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=244836.66666666666, ans=0.09899494936611666 2023-10-10 04:48:44,663 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.80 vs. limit=10.0 2023-10-10 04:49:14,292 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.06 vs. limit=15.0 2023-10-10 04:49:20,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=245023.33333333334, ans=0.125 2023-10-10 04:49:37,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=245116.66666666666, ans=0.0 2023-10-10 04:49:40,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=245116.66666666666, ans=0.125 2023-10-10 04:49:48,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245163.33333333334, ans=0.1 2023-10-10 04:49:59,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.764e+02 1.942e+02 2.126e+02 2.780e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-10 04:50:08,688 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:50:08,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.30 vs. limit=6.0 2023-10-10 04:50:10,597 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=15.0 2023-10-10 04:50:21,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=245256.66666666666, ans=0.1 2023-10-10 04:50:30,541 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.85 vs. limit=15.0 2023-10-10 04:50:39,356 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.83 vs. limit=15.0 2023-10-10 04:50:40,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245350.0, ans=0.1 2023-10-10 04:50:41,374 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.73 vs. limit=10.0 2023-10-10 04:50:44,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.08 vs. limit=22.5 2023-10-10 04:51:13,367 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.98 vs. limit=22.5 2023-10-10 04:51:29,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=245536.66666666666, ans=0.05 2023-10-10 04:51:53,886 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.770e+02 1.988e+02 2.245e+02 3.333e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-10 04:51:57,698 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.14 vs. limit=15.0 2023-10-10 04:52:15,184 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.05 vs. limit=15.0 2023-10-10 04:52:36,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=245863.33333333334, ans=0.125 2023-10-10 04:52:38,046 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:52:44,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-10-10 04:52:47,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=245910.0, ans=0.125 2023-10-10 04:53:08,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=245956.66666666666, ans=0.0 2023-10-10 04:53:46,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=246096.66666666666, ans=0.2 2023-10-10 04:53:50,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=246096.66666666666, ans=0.0 2023-10-10 04:53:53,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.861e+02 2.098e+02 2.398e+02 3.266e+02, threshold=4.196e+02, percent-clipped=0.0 2023-10-10 04:54:03,368 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-10-10 04:54:06,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=246190.0, ans=0.0 2023-10-10 04:54:11,387 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.89 vs. limit=22.5 2023-10-10 04:54:14,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=246190.0, ans=0.125 2023-10-10 04:54:14,720 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.39 vs. limit=6.0 2023-10-10 04:54:17,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=246236.66666666666, ans=0.125 2023-10-10 04:54:34,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=246283.33333333334, ans=0.0 2023-10-10 04:55:04,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=12.0 2023-10-10 04:55:13,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=246423.33333333334, ans=0.125 2023-10-10 04:55:24,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=246470.0, ans=0.125 2023-10-10 04:55:48,529 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.845e+02 2.115e+02 2.356e+02 3.431e+02, threshold=4.230e+02, percent-clipped=0.0 2023-10-10 04:56:00,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.86 vs. limit=6.0 2023-10-10 04:56:08,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=246656.66666666666, ans=0.125 2023-10-10 04:56:20,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=246703.33333333334, ans=0.0 2023-10-10 04:56:25,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=246750.0, ans=0.125 2023-10-10 04:56:33,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=246750.0, ans=0.125 2023-10-10 04:56:43,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=246796.66666666666, ans=0.125 2023-10-10 04:56:51,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=246843.33333333334, ans=0.0 2023-10-10 04:56:54,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=246890.0, ans=0.1 2023-10-10 04:57:09,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246936.66666666666, ans=0.1 2023-10-10 04:57:42,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247076.66666666666, ans=0.1 2023-10-10 04:57:42,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.758e+02 1.941e+02 2.126e+02 3.214e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-10 04:57:55,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=247123.33333333334, ans=0.0 2023-10-10 04:57:55,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.43 vs. limit=15.0 2023-10-10 04:58:03,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-10-10 04:58:04,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=247170.0, ans=0.1 2023-10-10 04:58:05,084 INFO [train.py:1031] (0/4) Epoch 4, batch 12000, loss[loss=0.244, simple_loss=0.3311, pruned_loss=0.0785, over 16909.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3224, pruned_loss=0.08055, over 32739518.00 frames. ], batch size: 165, lr: 9.48e-03, grad_scale: 32.0 2023-10-10 04:58:15,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=247170.0, ans=0.0 2023-10-10 04:58:20,350 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.81 vs. limit=10.0 2023-10-10 04:58:22,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=247216.66666666666, ans=0.05 2023-10-10 04:58:22,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=247216.66666666666, ans=0.0 2023-10-10 04:58:37,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=247263.33333333334, ans=0.125 2023-10-10 04:58:44,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=247310.0, ans=0.1 2023-10-10 04:58:55,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247356.66666666666, ans=0.1 2023-10-10 04:59:00,677 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 04:59:09,481 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=22.5 2023-10-10 04:59:18,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=247450.0, ans=0.125 2023-10-10 04:59:35,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247496.66666666666, ans=0.1 2023-10-10 04:59:35,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=247543.33333333334, ans=0.125 2023-10-10 04:59:36,755 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.741e+02 1.994e+02 2.295e+02 3.182e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-10 04:59:37,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=247543.33333333334, ans=0.125 2023-10-10 04:59:46,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=247590.0, ans=0.125 2023-10-10 04:59:52,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=247590.0, ans=0.125 2023-10-10 05:00:05,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=247636.66666666666, ans=0.125 2023-10-10 05:00:40,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=247823.33333333334, ans=0.125 2023-10-10 05:00:45,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=247823.33333333334, ans=0.0 2023-10-10 05:00:45,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=247823.33333333334, ans=0.95 2023-10-10 05:01:01,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=247916.66666666666, ans=0.0 2023-10-10 05:01:03,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=247916.66666666666, ans=0.2 2023-10-10 05:01:03,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.59 vs. limit=15.0 2023-10-10 05:01:17,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=247963.33333333334, ans=0.04949747468305833 2023-10-10 05:01:19,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247963.33333333334, ans=0.1 2023-10-10 05:01:22,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=248010.0, ans=0.0 2023-10-10 05:01:22,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-10-10 05:01:22,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.735e+02 1.975e+02 2.171e+02 3.178e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-10 05:01:26,207 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.04 vs. limit=22.5 2023-10-10 05:01:36,813 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.91 vs. limit=10.0 2023-10-10 05:01:39,662 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:01:47,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=248103.33333333334, ans=0.125 2023-10-10 05:01:57,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=248150.0, ans=0.1 2023-10-10 05:02:22,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=248243.33333333334, ans=0.1 2023-10-10 05:02:23,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248243.33333333334, ans=0.1 2023-10-10 05:02:28,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=248290.0, ans=0.125 2023-10-10 05:02:34,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=15.0 2023-10-10 05:02:35,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.37 vs. limit=6.0 2023-10-10 05:02:38,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=8.0 2023-10-10 05:02:42,644 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.88 vs. limit=15.0 2023-10-10 05:03:08,542 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.797e+02 1.984e+02 2.225e+02 3.698e+02, threshold=3.967e+02, percent-clipped=0.0 2023-10-10 05:03:16,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=248476.66666666666, ans=0.125 2023-10-10 05:03:31,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=248570.0, ans=0.125 2023-10-10 05:03:35,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=248570.0, ans=0.125 2023-10-10 05:03:40,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=248570.0, ans=0.0 2023-10-10 05:04:44,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=248850.0, ans=0.0 2023-10-10 05:04:54,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.02 vs. limit=10.0 2023-10-10 05:04:55,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-10-10 05:04:58,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=248896.66666666666, ans=0.2 2023-10-10 05:05:02,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.769e+02 2.035e+02 2.348e+02 3.620e+02, threshold=4.071e+02, percent-clipped=0.0 2023-10-10 05:05:07,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=248943.33333333334, ans=0.0 2023-10-10 05:05:15,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=248990.0, ans=0.125 2023-10-10 05:05:27,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=249036.66666666666, ans=0.0 2023-10-10 05:05:34,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=249036.66666666666, ans=0.1 2023-10-10 05:05:59,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=249176.66666666666, ans=0.2 2023-10-10 05:06:13,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=249223.33333333334, ans=0.0 2023-10-10 05:06:16,098 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.54 vs. limit=22.5 2023-10-10 05:06:38,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=249316.66666666666, ans=0.125 2023-10-10 05:06:40,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=249316.66666666666, ans=0.125 2023-10-10 05:06:42,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=249316.66666666666, ans=0.125 2023-10-10 05:06:51,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=249363.33333333334, ans=0.125 2023-10-10 05:06:53,926 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=15.0 2023-10-10 05:06:54,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.768e+02 1.991e+02 2.226e+02 3.170e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-10 05:07:01,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=249410.0, ans=0.125 2023-10-10 05:07:18,497 INFO [train.py:1031] (0/4) Epoch 4, batch 12500, loss[loss=0.2601, simple_loss=0.3336, pruned_loss=0.09333, over 16605.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.322, pruned_loss=0.08066, over 32723551.66 frames. ], batch size: 56, lr: 9.43e-03, grad_scale: 32.0 2023-10-10 05:07:28,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=249550.0, ans=0.125 2023-10-10 05:07:32,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=249550.0, ans=0.125 2023-10-10 05:07:33,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=249550.0, ans=0.125 2023-10-10 05:07:39,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=249596.66666666666, ans=0.0 2023-10-10 05:07:50,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=249643.33333333334, ans=0.2 2023-10-10 05:07:53,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=249643.33333333334, ans=0.125 2023-10-10 05:08:13,107 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:08:43,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.722e+02 1.899e+02 2.216e+02 4.050e+02, threshold=3.797e+02, percent-clipped=1.0 2023-10-10 05:08:45,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=249876.66666666666, ans=0.0 2023-10-10 05:09:02,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=249923.33333333334, ans=0.125 2023-10-10 05:09:07,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=249970.0, ans=0.0 2023-10-10 05:09:08,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=249970.0, ans=0.125 2023-10-10 05:09:32,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=250063.33333333334, ans=0.125 2023-10-10 05:09:34,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250063.33333333334, ans=0.1 2023-10-10 05:09:50,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=250156.66666666666, ans=0.0 2023-10-10 05:09:56,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=250156.66666666666, ans=0.0 2023-10-10 05:09:57,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.96 vs. limit=15.0 2023-10-10 05:10:06,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-10-10 05:10:11,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=250203.33333333334, ans=0.125 2023-10-10 05:10:16,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=250250.0, ans=0.0 2023-10-10 05:10:23,307 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.37 vs. limit=15.0 2023-10-10 05:10:33,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=250343.33333333334, ans=0.125 2023-10-10 05:10:34,554 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.710e+02 2.008e+02 2.335e+02 3.389e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-10 05:10:34,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250343.33333333334, ans=0.1 2023-10-10 05:11:06,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=250483.33333333334, ans=0.125 2023-10-10 05:11:07,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=250483.33333333334, ans=0.5 2023-10-10 05:11:07,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=250483.33333333334, ans=0.125 2023-10-10 05:11:08,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=250483.33333333334, ans=0.0 2023-10-10 05:11:09,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=250483.33333333334, ans=0.125 2023-10-10 05:11:11,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=250483.33333333334, ans=0.125 2023-10-10 05:11:16,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=250530.0, ans=0.09899494936611666 2023-10-10 05:11:29,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=250576.66666666666, ans=0.2 2023-10-10 05:11:31,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250576.66666666666, ans=0.1 2023-10-10 05:11:33,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250576.66666666666, ans=0.1 2023-10-10 05:11:37,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=250623.33333333334, ans=0.125 2023-10-10 05:11:40,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=250623.33333333334, ans=0.125 2023-10-10 05:11:45,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=250623.33333333334, ans=0.0 2023-10-10 05:11:59,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=250670.0, ans=0.125 2023-10-10 05:12:01,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-10-10 05:12:02,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=250716.66666666666, ans=0.125 2023-10-10 05:12:04,026 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:12:21,668 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.810e+02 2.009e+02 2.245e+02 3.158e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-10 05:12:27,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=250810.0, ans=0.0 2023-10-10 05:12:37,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250856.66666666666, ans=0.1 2023-10-10 05:12:39,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=250856.66666666666, ans=0.0 2023-10-10 05:12:43,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=250903.33333333334, ans=0.125 2023-10-10 05:13:04,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=250996.66666666666, ans=0.125 2023-10-10 05:13:08,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.87 vs. limit=22.5 2023-10-10 05:13:16,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=251043.33333333334, ans=10.0 2023-10-10 05:13:36,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.07 vs. limit=15.0 2023-10-10 05:13:56,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=251183.33333333334, ans=0.0 2023-10-10 05:14:10,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.306e+02 1.750e+02 1.943e+02 2.150e+02 2.998e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-10 05:14:11,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251276.66666666666, ans=0.1 2023-10-10 05:14:18,534 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-10-10 05:14:22,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=251323.33333333334, ans=0.0 2023-10-10 05:14:24,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=251323.33333333334, ans=0.125 2023-10-10 05:14:31,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=251370.0, ans=0.1 2023-10-10 05:14:32,934 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-10-10 05:14:35,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-10-10 05:14:40,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=251370.0, ans=15.0 2023-10-10 05:14:45,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=15.0 2023-10-10 05:14:48,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=251416.66666666666, ans=0.1 2023-10-10 05:14:49,216 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-10-10 05:14:53,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=251463.33333333334, ans=0.09899494936611666 2023-10-10 05:14:57,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=251463.33333333334, ans=0.1 2023-10-10 05:15:12,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=251510.0, ans=0.0 2023-10-10 05:15:13,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.74 vs. limit=15.0 2023-10-10 05:15:14,745 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.94 vs. limit=15.0 2023-10-10 05:15:15,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-10-10 05:15:17,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=251556.66666666666, ans=0.125 2023-10-10 05:15:28,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.15 vs. limit=15.0 2023-10-10 05:15:35,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=251603.33333333334, ans=0.0 2023-10-10 05:15:39,628 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-10-10 05:15:53,371 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=15.0 2023-10-10 05:15:57,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.825e+02 2.034e+02 2.506e+02 3.898e+02, threshold=4.069e+02, percent-clipped=1.0 2023-10-10 05:16:19,817 INFO [train.py:1031] (0/4) Epoch 4, batch 13000, loss[loss=0.2394, simple_loss=0.318, pruned_loss=0.08037, over 16918.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3226, pruned_loss=0.08093, over 32736479.60 frames. ], batch size: 138, lr: 9.39e-03, grad_scale: 32.0 2023-10-10 05:16:21,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=251836.66666666666, ans=0.1 2023-10-10 05:16:48,468 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.94 vs. limit=22.5 2023-10-10 05:16:52,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251930.0, ans=0.1 2023-10-10 05:16:59,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=251976.66666666666, ans=0.125 2023-10-10 05:17:06,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=12.0 2023-10-10 05:17:07,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=251976.66666666666, ans=0.125 2023-10-10 05:17:08,040 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-10-10 05:17:32,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=252116.66666666666, ans=0.04949747468305833 2023-10-10 05:17:52,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=252163.33333333334, ans=0.0 2023-10-10 05:17:57,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.781e+02 1.954e+02 2.240e+02 3.918e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-10 05:18:00,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=252210.0, ans=0.0 2023-10-10 05:18:21,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=252303.33333333334, ans=0.125 2023-10-10 05:18:36,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=252350.0, ans=0.05 2023-10-10 05:18:39,217 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.67 vs. limit=15.0 2023-10-10 05:18:48,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=252396.66666666666, ans=0.1 2023-10-10 05:19:12,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=252490.0, ans=0.0 2023-10-10 05:19:13,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=252490.0, ans=15.0 2023-10-10 05:19:14,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=252490.0, ans=0.125 2023-10-10 05:19:23,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=252536.66666666666, ans=0.125 2023-10-10 05:19:34,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=252583.33333333334, ans=0.2 2023-10-10 05:19:42,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=252630.0, ans=0.125 2023-10-10 05:19:47,244 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:19:51,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.785e+02 2.025e+02 2.230e+02 3.366e+02, threshold=4.051e+02, percent-clipped=0.0 2023-10-10 05:20:04,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=252676.66666666666, ans=0.125 2023-10-10 05:20:48,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-10-10 05:20:53,336 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-10-10 05:20:56,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-10-10 05:20:59,085 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.17 vs. limit=22.5 2023-10-10 05:21:01,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=252956.66666666666, ans=0.0 2023-10-10 05:21:12,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=253003.33333333334, ans=0.035 2023-10-10 05:21:14,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=253003.33333333334, ans=0.125 2023-10-10 05:21:16,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.88 vs. limit=22.5 2023-10-10 05:21:25,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=253050.0, ans=0.125 2023-10-10 05:21:26,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=253050.0, ans=0.125 2023-10-10 05:21:33,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=253096.66666666666, ans=0.125 2023-10-10 05:21:41,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=253096.66666666666, ans=0.5 2023-10-10 05:21:44,888 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.871e+02 2.080e+02 2.573e+02 4.064e+02, threshold=4.160e+02, percent-clipped=1.0 2023-10-10 05:21:53,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253143.33333333334, ans=0.1 2023-10-10 05:21:55,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=253190.0, ans=15.0 2023-10-10 05:22:08,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=253236.66666666666, ans=0.125 2023-10-10 05:22:23,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=253283.33333333334, ans=0.125 2023-10-10 05:22:59,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=253423.33333333334, ans=0.125 2023-10-10 05:23:09,943 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.43 vs. limit=15.0 2023-10-10 05:23:16,774 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-10-10 05:23:27,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=253563.33333333334, ans=0.125 2023-10-10 05:23:37,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 2.057e+02 2.336e+02 2.783e+02 4.529e+02, threshold=4.671e+02, percent-clipped=2.0 2023-10-10 05:23:38,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=253610.0, ans=0.125 2023-10-10 05:23:54,826 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:24:19,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=253796.66666666666, ans=0.0 2023-10-10 05:24:19,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=253796.66666666666, ans=0.0 2023-10-10 05:24:30,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=253843.33333333334, ans=0.2 2023-10-10 05:24:31,692 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:24:39,235 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.68 vs. limit=22.5 2023-10-10 05:24:41,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=253890.0, ans=0.125 2023-10-10 05:24:41,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=253890.0, ans=0.07 2023-10-10 05:24:41,711 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-10-10 05:24:51,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=253936.66666666666, ans=0.02 2023-10-10 05:24:56,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=253936.66666666666, ans=0.125 2023-10-10 05:25:22,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.758e+02 1.911e+02 2.264e+02 2.946e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-10 05:25:36,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2023-10-10 05:25:42,326 INFO [train.py:1031] (0/4) Epoch 4, batch 13500, loss[loss=0.2279, simple_loss=0.3161, pruned_loss=0.0698, over 16954.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3217, pruned_loss=0.08052, over 32731612.46 frames. ], batch size: 82, lr: 9.35e-03, grad_scale: 16.0 2023-10-10 05:25:57,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=254216.66666666666, ans=0.125 2023-10-10 05:26:19,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.43 vs. limit=22.5 2023-10-10 05:26:31,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=254356.66666666666, ans=0.0 2023-10-10 05:26:34,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=254356.66666666666, ans=0.1 2023-10-10 05:26:43,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=254403.33333333334, ans=0.0 2023-10-10 05:27:12,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.759e+02 1.982e+02 2.253e+02 2.975e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-10 05:27:13,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=254543.33333333334, ans=10.0 2023-10-10 05:27:15,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=254543.33333333334, ans=0.125 2023-10-10 05:27:18,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=254543.33333333334, ans=0.2 2023-10-10 05:27:18,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=254543.33333333334, ans=0.05 2023-10-10 05:27:25,290 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.53 vs. limit=15.0 2023-10-10 05:27:26,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=254590.0, ans=0.125 2023-10-10 05:27:31,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=254636.66666666666, ans=0.125 2023-10-10 05:27:42,989 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.874e-02 2023-10-10 05:27:57,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.13 vs. limit=22.5 2023-10-10 05:28:00,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.98 vs. limit=6.0 2023-10-10 05:28:07,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=254823.33333333334, ans=0.125 2023-10-10 05:28:23,976 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-4.pt 2023-10-10 05:28:52,472 INFO [train.py:1031] (0/4) Epoch 5, batch 0, loss[loss=0.2184, simple_loss=0.2981, pruned_loss=0.0693, over 16805.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2981, pruned_loss=0.0693, over 16805.00 frames. ], batch size: 188, lr: 8.17e-03, grad_scale: 32.0 2023-10-10 05:28:52,473 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-10 05:29:00,182 INFO [train.py:1063] (0/4) Epoch 5, validation: loss=0.2397, simple_loss=0.3257, pruned_loss=0.07681, over 1020973.00 frames. 2023-10-10 05:29:00,183 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-10 05:29:21,654 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-10-10 05:29:28,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.854e+02 2.007e+02 2.304e+02 3.822e+02, threshold=4.014e+02, percent-clipped=0.0 2023-10-10 05:29:30,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=254986.66666666666, ans=0.025 2023-10-10 05:29:31,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=255033.33333333334, ans=0.125 2023-10-10 05:29:55,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=255126.66666666666, ans=0.0 2023-10-10 05:30:26,917 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-10-10 05:30:36,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=255266.66666666666, ans=0.0 2023-10-10 05:30:40,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=255313.33333333334, ans=0.2 2023-10-10 05:30:54,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=255360.0, ans=0.1 2023-10-10 05:31:04,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=255406.66666666666, ans=0.125 2023-10-10 05:31:04,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=255406.66666666666, ans=0.0 2023-10-10 05:31:18,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=255453.33333333334, ans=0.125 2023-10-10 05:31:19,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.663e+02 1.776e+02 2.118e+02 3.146e+02, threshold=3.553e+02, percent-clipped=0.0 2023-10-10 05:31:59,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=255640.0, ans=0.0 2023-10-10 05:32:08,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=255686.66666666666, ans=0.125 2023-10-10 05:32:20,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=255733.33333333334, ans=0.0 2023-10-10 05:32:24,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=255733.33333333334, ans=0.05 2023-10-10 05:32:24,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=255733.33333333334, ans=0.0 2023-10-10 05:32:29,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=255780.0, ans=0.125 2023-10-10 05:32:42,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=255826.66666666666, ans=0.1 2023-10-10 05:32:50,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=255873.33333333334, ans=0.1 2023-10-10 05:32:50,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=255873.33333333334, ans=0.0 2023-10-10 05:32:52,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=255873.33333333334, ans=0.09899494936611666 2023-10-10 05:32:57,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=255873.33333333334, ans=0.125 2023-10-10 05:32:57,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=255873.33333333334, ans=0.0 2023-10-10 05:33:02,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=255920.0, ans=0.2 2023-10-10 05:33:05,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.871e+02 2.164e+02 2.479e+02 3.459e+02, threshold=4.327e+02, percent-clipped=0.0 2023-10-10 05:33:32,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-10-10 05:33:54,379 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.30 vs. limit=10.0 2023-10-10 05:34:12,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=256200.0, ans=0.0 2023-10-10 05:34:12,901 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:34:23,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=256246.66666666666, ans=0.125 2023-10-10 05:34:26,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=256246.66666666666, ans=0.125 2023-10-10 05:34:41,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=256340.0, ans=0.2 2023-10-10 05:34:53,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=256386.66666666666, ans=0.125 2023-10-10 05:34:57,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.829e+02 2.034e+02 2.392e+02 3.401e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-10 05:35:02,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.80 vs. limit=15.0 2023-10-10 05:35:29,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=256526.66666666666, ans=0.0 2023-10-10 05:35:35,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256573.33333333334, ans=0.1 2023-10-10 05:35:41,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=256573.33333333334, ans=0.125 2023-10-10 05:35:45,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=256620.0, ans=0.125 2023-10-10 05:35:55,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=256666.66666666666, ans=0.125 2023-10-10 05:35:55,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=256666.66666666666, ans=0.0 2023-10-10 05:36:04,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=256713.33333333334, ans=0.125 2023-10-10 05:36:33,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=256806.66666666666, ans=0.125 2023-10-10 05:36:41,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256853.33333333334, ans=0.1 2023-10-10 05:36:42,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.905e+02 2.404e+02 2.784e+02 3.937e+02, threshold=4.808e+02, percent-clipped=0.0 2023-10-10 05:36:57,177 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-10 05:37:03,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=256946.66666666666, ans=0.125 2023-10-10 05:37:13,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=256993.33333333334, ans=0.125 2023-10-10 05:37:16,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=256993.33333333334, ans=0.0 2023-10-10 05:37:34,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257086.66666666666, ans=0.1 2023-10-10 05:37:47,837 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.53 vs. limit=6.0 2023-10-10 05:37:53,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.24 vs. limit=22.5 2023-10-10 05:37:56,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=257180.0, ans=0.125 2023-10-10 05:37:59,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=257180.0, ans=0.125 2023-10-10 05:38:07,209 INFO [train.py:1031] (0/4) Epoch 5, batch 500, loss[loss=0.2231, simple_loss=0.297, pruned_loss=0.0746, over 15427.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3196, pruned_loss=0.07837, over 7288672.64 frames. ], batch size: 35, lr: 8.14e-03, grad_scale: 32.0 2023-10-10 05:38:10,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=257226.66666666666, ans=0.125 2023-10-10 05:38:17,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=257273.33333333334, ans=0.0 2023-10-10 05:38:19,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=257273.33333333334, ans=0.125 2023-10-10 05:38:31,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.60 vs. limit=15.0 2023-10-10 05:38:35,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.710e+02 1.903e+02 2.059e+02 2.708e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 05:38:37,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=257320.0, ans=0.125 2023-10-10 05:38:38,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=257320.0, ans=0.0 2023-10-10 05:38:50,490 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.37 vs. limit=22.5 2023-10-10 05:38:57,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=257413.33333333334, ans=0.125 2023-10-10 05:39:02,213 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-10-10 05:39:03,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=257460.0, ans=0.125 2023-10-10 05:39:10,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=257506.66666666666, ans=0.125 2023-10-10 05:39:10,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=257506.66666666666, ans=0.125 2023-10-10 05:39:12,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=257506.66666666666, ans=0.5 2023-10-10 05:39:17,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=257506.66666666666, ans=0.2 2023-10-10 05:39:37,668 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:39:38,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=257600.0, ans=0.125 2023-10-10 05:39:59,818 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.16 vs. limit=10.0 2023-10-10 05:40:11,808 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.15 vs. limit=15.0 2023-10-10 05:40:23,218 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.783e+02 2.012e+02 2.365e+02 2.915e+02, threshold=4.024e+02, percent-clipped=0.0 2023-10-10 05:40:23,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=257786.66666666666, ans=0.2 2023-10-10 05:40:50,666 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.47 vs. limit=15.0 2023-10-10 05:40:55,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=257926.66666666666, ans=0.125 2023-10-10 05:40:59,597 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.66 vs. limit=22.5 2023-10-10 05:41:01,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=257973.33333333334, ans=0.0 2023-10-10 05:41:08,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257973.33333333334, ans=0.1 2023-10-10 05:41:18,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258020.0, ans=0.1 2023-10-10 05:41:21,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=258020.0, ans=0.0 2023-10-10 05:41:23,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.72 vs. limit=15.0 2023-10-10 05:41:25,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.12 vs. limit=10.0 2023-10-10 05:41:44,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=258160.0, ans=0.0 2023-10-10 05:41:51,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=258160.0, ans=0.125 2023-10-10 05:41:56,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=258206.66666666666, ans=0.1 2023-10-10 05:42:12,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.825e+02 2.056e+02 2.493e+02 3.427e+02, threshold=4.113e+02, percent-clipped=0.0 2023-10-10 05:42:50,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=258393.33333333334, ans=0.125 2023-10-10 05:42:51,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=258393.33333333334, ans=0.0 2023-10-10 05:43:11,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=258486.66666666666, ans=0.0 2023-10-10 05:43:32,654 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.67 vs. limit=5.0 2023-10-10 05:43:39,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=258626.66666666666, ans=0.125 2023-10-10 05:44:01,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=258720.0, ans=0.015 2023-10-10 05:44:06,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.753e+02 1.933e+02 2.301e+02 3.796e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-10 05:44:22,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=258766.66666666666, ans=0.125 2023-10-10 05:44:49,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=258906.66666666666, ans=0.0 2023-10-10 05:44:57,627 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-10-10 05:44:58,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.00 vs. limit=15.0 2023-10-10 05:45:01,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=258953.33333333334, ans=0.125 2023-10-10 05:45:37,159 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.044e-02 2023-10-10 05:45:53,413 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.11 vs. limit=15.0 2023-10-10 05:46:02,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.729e+02 1.900e+02 2.226e+02 3.301e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-10 05:46:13,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=259233.33333333334, ans=0.125 2023-10-10 05:46:16,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-10-10 05:46:24,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=259280.0, ans=0.125 2023-10-10 05:46:35,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=259326.66666666666, ans=0.125 2023-10-10 05:46:43,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.11 vs. limit=15.0 2023-10-10 05:46:51,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=259373.33333333334, ans=0.2 2023-10-10 05:46:57,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=259420.0, ans=0.2 2023-10-10 05:47:01,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.31 vs. limit=15.0 2023-10-10 05:47:09,529 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:47:24,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=259560.0, ans=0.125 2023-10-10 05:47:25,187 INFO [train.py:1031] (0/4) Epoch 5, batch 1000, loss[loss=0.2502, simple_loss=0.3245, pruned_loss=0.08796, over 16515.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3197, pruned_loss=0.07854, over 12937369.19 frames. ], batch size: 266, lr: 8.10e-03, grad_scale: 32.0 2023-10-10 05:47:30,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=259560.0, ans=0.125 2023-10-10 05:47:32,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=259560.0, ans=0.125 2023-10-10 05:47:48,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=259653.33333333334, ans=0.0 2023-10-10 05:47:52,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.755e+02 2.072e+02 2.523e+02 4.280e+02, threshold=4.144e+02, percent-clipped=3.0 2023-10-10 05:47:59,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=259700.0, ans=0.125 2023-10-10 05:48:13,319 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-10-10 05:49:10,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=260026.66666666666, ans=0.0 2023-10-10 05:49:20,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=260073.33333333334, ans=0.125 2023-10-10 05:49:37,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=260120.0, ans=0.0 2023-10-10 05:49:40,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.785e+02 2.054e+02 2.321e+02 3.377e+02, threshold=4.109e+02, percent-clipped=0.0 2023-10-10 05:49:57,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-10-10 05:50:06,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=260213.33333333334, ans=0.0 2023-10-10 05:50:06,865 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.09 vs. limit=15.0 2023-10-10 05:50:10,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=260260.0, ans=0.125 2023-10-10 05:50:10,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=260260.0, ans=0.0 2023-10-10 05:50:25,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=260306.66666666666, ans=0.09899494936611666 2023-10-10 05:50:30,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=260306.66666666666, ans=0.125 2023-10-10 05:50:35,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=260353.33333333334, ans=10.0 2023-10-10 05:50:52,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=260400.0, ans=0.125 2023-10-10 05:50:56,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=260400.0, ans=0.1 2023-10-10 05:51:11,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=260493.33333333334, ans=0.125 2023-10-10 05:51:17,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=260493.33333333334, ans=0.0 2023-10-10 05:51:22,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=260540.0, ans=0.125 2023-10-10 05:51:35,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=260586.66666666666, ans=0.125 2023-10-10 05:51:37,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.584e+02 1.856e+02 2.185e+02 3.806e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-10 05:51:49,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260633.33333333334, ans=0.1 2023-10-10 05:52:06,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=260726.66666666666, ans=0.04949747468305833 2023-10-10 05:52:18,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=260773.33333333334, ans=0.0 2023-10-10 05:52:26,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=260820.0, ans=0.125 2023-10-10 05:52:28,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.72 vs. limit=8.0 2023-10-10 05:52:38,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=260866.66666666666, ans=0.125 2023-10-10 05:52:52,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.38 vs. limit=22.5 2023-10-10 05:52:53,344 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=12.0 2023-10-10 05:53:09,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=261006.66666666666, ans=0.125 2023-10-10 05:53:10,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=261006.66666666666, ans=0.125 2023-10-10 05:53:22,366 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.316e+02 1.722e+02 1.988e+02 2.282e+02 3.021e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-10 05:53:31,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=261100.0, ans=0.125 2023-10-10 05:53:52,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2023-10-10 05:54:03,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=261240.0, ans=0.07 2023-10-10 05:54:04,654 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:54:08,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=261286.66666666666, ans=0.125 2023-10-10 05:54:14,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=261286.66666666666, ans=0.125 2023-10-10 05:54:18,403 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2023-10-10 05:54:18,935 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-56000.pt 2023-10-10 05:54:26,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=261333.33333333334, ans=0.2 2023-10-10 05:54:32,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=261380.0, ans=0.125 2023-10-10 05:55:02,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=261473.33333333334, ans=0.125 2023-10-10 05:55:12,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.774e+02 1.995e+02 2.346e+02 3.424e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-10 05:55:19,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=261566.66666666666, ans=0.2 2023-10-10 05:55:33,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=261613.33333333334, ans=0.0 2023-10-10 05:55:40,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=261660.0, ans=0.125 2023-10-10 05:55:46,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=261660.0, ans=0.1 2023-10-10 05:55:54,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.31 vs. limit=15.0 2023-10-10 05:56:00,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.68 vs. limit=15.0 2023-10-10 05:56:20,495 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:56:25,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=261846.66666666666, ans=0.125 2023-10-10 05:56:33,739 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 05:56:35,878 INFO [train.py:1031] (0/4) Epoch 5, batch 1500, loss[loss=0.2384, simple_loss=0.3152, pruned_loss=0.08076, over 16877.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3173, pruned_loss=0.07701, over 17357176.94 frames. ], batch size: 130, lr: 8.07e-03, grad_scale: 32.0 2023-10-10 05:56:51,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=261940.0, ans=0.2 2023-10-10 05:57:07,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.699e+02 1.927e+02 2.323e+02 3.652e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-10 05:57:12,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262033.33333333334, ans=0.1 2023-10-10 05:57:19,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=262033.33333333334, ans=0.0 2023-10-10 05:57:23,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=262080.0, ans=0.1 2023-10-10 05:57:49,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=262173.3333333333, ans=0.125 2023-10-10 05:57:55,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=262173.3333333333, ans=0.035 2023-10-10 05:58:09,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=262266.6666666667, ans=0.125 2023-10-10 05:58:16,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=12.0 2023-10-10 05:58:16,247 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.51 vs. limit=10.0 2023-10-10 05:58:18,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=262266.6666666667, ans=0.5 2023-10-10 05:58:22,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=262313.3333333333, ans=0.0 2023-10-10 05:58:22,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=262313.3333333333, ans=0.125 2023-10-10 05:58:39,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-10-10 05:58:52,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=262453.3333333333, ans=0.125 2023-10-10 05:59:00,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.842e+02 2.109e+02 2.479e+02 3.330e+02, threshold=4.218e+02, percent-clipped=0.0 2023-10-10 05:59:12,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=262500.0, ans=15.0 2023-10-10 05:59:37,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=262593.3333333333, ans=0.0 2023-10-10 05:59:38,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=262593.3333333333, ans=0.125 2023-10-10 05:59:42,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=262593.3333333333, ans=0.125 2023-10-10 05:59:46,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=15.0 2023-10-10 05:59:58,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.11 vs. limit=15.0 2023-10-10 06:00:14,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=262733.3333333333, ans=0.07 2023-10-10 06:00:29,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=262826.6666666667, ans=0.125 2023-10-10 06:00:33,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=262826.6666666667, ans=0.0 2023-10-10 06:00:39,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=262873.3333333333, ans=0.0 2023-10-10 06:00:49,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=262920.0, ans=0.1 2023-10-10 06:00:56,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.713e+02 1.891e+02 2.172e+02 4.297e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-10 06:00:57,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262920.0, ans=0.1 2023-10-10 06:01:12,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=263013.3333333333, ans=0.2 2023-10-10 06:01:31,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=263106.6666666667, ans=0.015 2023-10-10 06:01:33,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=263106.6666666667, ans=0.125 2023-10-10 06:02:24,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=263293.3333333333, ans=0.95 2023-10-10 06:02:24,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=263293.3333333333, ans=0.05 2023-10-10 06:02:25,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-10-10 06:02:25,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=263293.3333333333, ans=0.125 2023-10-10 06:02:27,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=263293.3333333333, ans=0.125 2023-10-10 06:02:33,184 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=22.5 2023-10-10 06:02:35,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=263340.0, ans=0.125 2023-10-10 06:02:42,863 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.48 vs. limit=15.0 2023-10-10 06:02:45,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=263386.6666666667, ans=0.04949747468305833 2023-10-10 06:02:50,242 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.686e+02 1.903e+02 2.219e+02 3.598e+02, threshold=3.806e+02, percent-clipped=1.0 2023-10-10 06:02:59,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=263433.3333333333, ans=0.5 2023-10-10 06:03:04,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=263480.0, ans=0.05 2023-10-10 06:03:15,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.89 vs. limit=15.0 2023-10-10 06:03:35,969 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.50 vs. limit=22.5 2023-10-10 06:03:47,432 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-10-10 06:03:54,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=263666.6666666667, ans=0.0 2023-10-10 06:03:58,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=263713.3333333333, ans=0.125 2023-10-10 06:04:01,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.95 vs. limit=22.5 2023-10-10 06:04:06,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.54 vs. limit=22.5 2023-10-10 06:04:07,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.21 vs. limit=22.5 2023-10-10 06:04:28,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=263806.6666666667, ans=0.5 2023-10-10 06:04:38,295 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.740e+02 1.965e+02 2.370e+02 3.403e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-10 06:04:40,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=263900.0, ans=0.0 2023-10-10 06:04:42,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=263900.0, ans=0.0 2023-10-10 06:05:11,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=263993.3333333333, ans=0.2 2023-10-10 06:05:46,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=264086.6666666667, ans=0.2 2023-10-10 06:05:59,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=264180.0, ans=0.09899494936611666 2023-10-10 06:06:13,161 INFO [train.py:1031] (0/4) Epoch 5, batch 2000, loss[loss=0.2306, simple_loss=0.3201, pruned_loss=0.07055, over 16601.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3177, pruned_loss=0.07698, over 20780085.12 frames. ], batch size: 219, lr: 8.03e-03, grad_scale: 32.0 2023-10-10 06:06:21,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=264226.6666666667, ans=0.0 2023-10-10 06:06:26,685 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-10 06:06:50,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.794e+02 2.060e+02 2.220e+02 2.983e+02, threshold=4.121e+02, percent-clipped=0.0 2023-10-10 06:06:57,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=264366.6666666667, ans=0.125 2023-10-10 06:07:17,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=264413.3333333333, ans=0.125 2023-10-10 06:07:17,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=264413.3333333333, ans=0.125 2023-10-10 06:07:18,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264460.0, ans=0.1 2023-10-10 06:07:24,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=264460.0, ans=0.0 2023-10-10 06:07:42,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=264553.3333333333, ans=0.2 2023-10-10 06:07:43,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=264553.3333333333, ans=0.125 2023-10-10 06:07:45,231 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-10-10 06:07:46,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=264553.3333333333, ans=0.95 2023-10-10 06:07:53,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=264600.0, ans=0.1 2023-10-10 06:08:25,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.66 vs. limit=22.5 2023-10-10 06:08:33,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=264693.3333333333, ans=0.125 2023-10-10 06:08:42,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=264740.0, ans=0.0 2023-10-10 06:09:08,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.650e+02 1.892e+02 2.078e+02 2.911e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-10 06:09:11,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=264833.3333333333, ans=0.0 2023-10-10 06:09:37,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=264926.6666666667, ans=0.95 2023-10-10 06:09:56,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=264973.3333333333, ans=0.5 2023-10-10 06:09:56,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=264973.3333333333, ans=0.125 2023-10-10 06:10:01,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265020.0, ans=0.1 2023-10-10 06:10:09,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-10-10 06:10:23,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265113.3333333333, ans=0.1 2023-10-10 06:10:34,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=265160.0, ans=0.0 2023-10-10 06:10:36,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=265160.0, ans=0.1 2023-10-10 06:10:36,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.32 vs. limit=15.0 2023-10-10 06:10:42,789 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=22.5 2023-10-10 06:10:44,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.61 vs. limit=12.0 2023-10-10 06:10:46,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=265206.6666666667, ans=0.2 2023-10-10 06:10:54,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=265253.3333333333, ans=0.0 2023-10-10 06:10:54,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.05 vs. limit=10.0 2023-10-10 06:11:01,398 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.757e+02 1.978e+02 2.230e+02 3.113e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-10 06:11:16,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=265346.6666666667, ans=0.125 2023-10-10 06:11:16,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=265346.6666666667, ans=0.125 2023-10-10 06:11:16,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=265346.6666666667, ans=0.0 2023-10-10 06:11:17,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=265346.6666666667, ans=0.125 2023-10-10 06:11:28,678 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.16 vs. limit=15.0 2023-10-10 06:11:33,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=265393.3333333333, ans=0.125 2023-10-10 06:11:35,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=265440.0, ans=0.0 2023-10-10 06:11:56,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-10-10 06:11:58,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=265533.3333333333, ans=0.0 2023-10-10 06:12:20,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=265626.6666666667, ans=0.125 2023-10-10 06:12:20,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=265626.6666666667, ans=0.125 2023-10-10 06:12:30,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=265673.3333333333, ans=0.125 2023-10-10 06:12:35,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.09 vs. limit=10.0 2023-10-10 06:12:44,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265720.0, ans=0.1 2023-10-10 06:12:46,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.268e+02 1.735e+02 1.886e+02 2.094e+02 2.828e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-10 06:13:07,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=265813.3333333333, ans=0.125 2023-10-10 06:13:11,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=265860.0, ans=0.09899494936611666 2023-10-10 06:13:16,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=265860.0, ans=0.2 2023-10-10 06:13:36,813 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-10-10 06:13:47,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-10-10 06:13:55,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.33 vs. limit=15.0 2023-10-10 06:13:56,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266046.6666666667, ans=0.1 2023-10-10 06:14:29,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=266186.6666666667, ans=0.0 2023-10-10 06:14:33,967 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.877e+02 2.196e+02 2.532e+02 3.568e+02, threshold=4.392e+02, percent-clipped=0.0 2023-10-10 06:14:36,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=266233.3333333333, ans=15.0 2023-10-10 06:14:56,488 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-10-10 06:14:57,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=266280.0, ans=0.0 2023-10-10 06:15:08,295 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.23 vs. limit=15.0 2023-10-10 06:15:13,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=266373.3333333333, ans=0.2 2023-10-10 06:15:20,558 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-10-10 06:15:28,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266420.0, ans=0.1 2023-10-10 06:15:45,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=266513.3333333333, ans=0.125 2023-10-10 06:15:48,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=266513.3333333333, ans=0.2 2023-10-10 06:15:57,169 INFO [train.py:1031] (0/4) Epoch 5, batch 2500, loss[loss=0.2293, simple_loss=0.3101, pruned_loss=0.07424, over 16915.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3175, pruned_loss=0.07698, over 23439663.04 frames. ], batch size: 87, lr: 8.00e-03, grad_scale: 32.0 2023-10-10 06:15:57,727 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.71 vs. limit=6.0 2023-10-10 06:16:20,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.05 vs. limit=15.0 2023-10-10 06:16:26,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.774e+02 1.929e+02 2.226e+02 3.327e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-10 06:16:29,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=266700.0, ans=0.0 2023-10-10 06:16:49,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=266793.3333333333, ans=0.125 2023-10-10 06:17:02,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=266840.0, ans=0.125 2023-10-10 06:17:09,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=266840.0, ans=0.125 2023-10-10 06:17:14,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=266886.6666666667, ans=0.0 2023-10-10 06:17:35,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=266980.0, ans=0.125 2023-10-10 06:17:36,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=266980.0, ans=0.125 2023-10-10 06:18:05,537 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.31 vs. limit=15.0 2023-10-10 06:18:18,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.774e+02 1.934e+02 2.190e+02 3.199e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-10 06:18:20,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=267120.0, ans=0.125 2023-10-10 06:18:29,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=267166.6666666667, ans=0.125 2023-10-10 06:18:32,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=267213.3333333333, ans=0.07 2023-10-10 06:18:34,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=267213.3333333333, ans=0.0 2023-10-10 06:19:41,171 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.535e-03 2023-10-10 06:19:57,504 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-10-10 06:20:14,377 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.747e+02 1.976e+02 2.222e+02 3.594e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-10 06:20:14,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=267586.6666666667, ans=0.0 2023-10-10 06:20:25,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.30 vs. limit=15.0 2023-10-10 06:20:25,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=267633.3333333333, ans=0.125 2023-10-10 06:20:25,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=267633.3333333333, ans=0.2 2023-10-10 06:20:27,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=267633.3333333333, ans=0.0 2023-10-10 06:20:36,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=267680.0, ans=0.125 2023-10-10 06:20:53,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2023-10-10 06:21:05,735 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=6.0 2023-10-10 06:21:35,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=267913.3333333333, ans=0.0 2023-10-10 06:22:00,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=268006.6666666667, ans=0.125 2023-10-10 06:22:16,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=268053.3333333333, ans=0.1 2023-10-10 06:22:17,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.749e+02 1.930e+02 2.387e+02 3.973e+02, threshold=3.860e+02, percent-clipped=1.0 2023-10-10 06:22:46,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=268146.6666666667, ans=0.0 2023-10-10 06:22:50,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=268193.3333333333, ans=0.125 2023-10-10 06:22:53,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=268193.3333333333, ans=0.1 2023-10-10 06:23:15,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=268286.6666666667, ans=0.2 2023-10-10 06:23:44,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=268380.0, ans=0.2 2023-10-10 06:24:12,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=268473.3333333333, ans=0.2 2023-10-10 06:24:14,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=268473.3333333333, ans=0.125 2023-10-10 06:24:26,196 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.770e+02 1.953e+02 2.195e+02 3.188e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-10 06:24:43,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=268613.3333333333, ans=0.125 2023-10-10 06:24:58,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-10-10 06:25:08,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=268706.6666666667, ans=0.125 2023-10-10 06:25:09,357 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.01 vs. limit=15.0 2023-10-10 06:25:45,720 INFO [train.py:1031] (0/4) Epoch 5, batch 3000, loss[loss=0.2704, simple_loss=0.3261, pruned_loss=0.1074, over 15628.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3168, pruned_loss=0.07708, over 25505813.01 frames. ], batch size: 350, lr: 7.96e-03, grad_scale: 16.0 2023-10-10 06:25:53,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=268893.3333333333, ans=0.0 2023-10-10 06:26:15,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.711e+02 1.905e+02 2.184e+02 3.895e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-10 06:26:25,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269033.3333333333, ans=0.1 2023-10-10 06:26:33,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=269080.0, ans=0.0 2023-10-10 06:26:50,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=269126.6666666667, ans=0.0 2023-10-10 06:27:03,628 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.39 vs. limit=15.0 2023-10-10 06:27:28,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=269313.3333333333, ans=0.125 2023-10-10 06:27:38,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=269360.0, ans=0.2 2023-10-10 06:27:51,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=269360.0, ans=0.125 2023-10-10 06:28:05,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=269406.6666666667, ans=0.125 2023-10-10 06:28:11,650 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-10-10 06:28:11,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.10 vs. limit=12.0 2023-10-10 06:28:16,304 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.847e+02 2.173e+02 2.461e+02 3.508e+02, threshold=4.345e+02, percent-clipped=0.0 2023-10-10 06:28:17,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=269500.0, ans=0.0 2023-10-10 06:28:23,076 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.68 vs. limit=22.5 2023-10-10 06:28:23,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269500.0, ans=0.1 2023-10-10 06:28:31,790 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-10-10 06:28:44,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=269593.3333333333, ans=0.035 2023-10-10 06:28:54,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=269640.0, ans=0.04949747468305833 2023-10-10 06:28:58,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=269640.0, ans=0.125 2023-10-10 06:29:10,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=269686.6666666667, ans=0.125 2023-10-10 06:29:25,723 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-10-10 06:29:37,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=269826.6666666667, ans=0.125 2023-10-10 06:29:37,839 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.90 vs. limit=15.0 2023-10-10 06:29:40,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.09 vs. limit=15.0 2023-10-10 06:29:48,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=269873.3333333333, ans=0.125 2023-10-10 06:30:07,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.732e+02 1.995e+02 2.215e+02 3.118e+02, threshold=3.991e+02, percent-clipped=0.0 2023-10-10 06:30:18,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=269966.6666666667, ans=0.0 2023-10-10 06:30:21,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=269966.6666666667, ans=0.05 2023-10-10 06:30:26,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=270013.3333333333, ans=0.0 2023-10-10 06:31:30,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=270246.6666666667, ans=0.2 2023-10-10 06:31:54,524 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-10-10 06:32:14,096 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.890e+02 2.285e+02 2.774e+02 4.730e+02, threshold=4.570e+02, percent-clipped=2.0 2023-10-10 06:32:33,434 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 06:32:57,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=270573.3333333333, ans=0.125 2023-10-10 06:32:59,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=270620.0, ans=0.0 2023-10-10 06:33:39,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.81 vs. limit=15.0 2023-10-10 06:33:46,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270760.0, ans=0.1 2023-10-10 06:33:47,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=270760.0, ans=0.07 2023-10-10 06:33:49,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=270760.0, ans=0.1 2023-10-10 06:33:49,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=270760.0, ans=0.125 2023-10-10 06:33:55,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=270806.6666666667, ans=0.125 2023-10-10 06:34:12,372 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.747e+02 1.945e+02 2.257e+02 3.070e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-10 06:34:15,110 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.41 vs. limit=10.0 2023-10-10 06:34:24,301 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.44 vs. limit=10.0 2023-10-10 06:34:45,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=270993.3333333333, ans=0.2 2023-10-10 06:35:03,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=271086.6666666667, ans=0.2 2023-10-10 06:35:23,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=271180.0, ans=0.125 2023-10-10 06:35:24,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=271180.0, ans=0.125 2023-10-10 06:35:26,249 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.51 vs. limit=10.0 2023-10-10 06:35:28,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=271180.0, ans=0.0 2023-10-10 06:35:34,944 INFO [train.py:1031] (0/4) Epoch 5, batch 3500, loss[loss=0.2359, simple_loss=0.3188, pruned_loss=0.07651, over 16851.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3168, pruned_loss=0.07707, over 27160348.12 frames. ], batch size: 146, lr: 7.93e-03, grad_scale: 32.0 2023-10-10 06:35:35,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=271226.6666666667, ans=0.0 2023-10-10 06:35:40,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=271226.6666666667, ans=0.0 2023-10-10 06:35:46,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=271273.3333333333, ans=0.125 2023-10-10 06:36:02,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.75 vs. limit=22.5 2023-10-10 06:36:05,622 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.690e+02 1.933e+02 2.157e+02 3.510e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-10 06:36:19,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=271413.3333333333, ans=0.125 2023-10-10 06:36:31,721 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 06:36:37,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=271460.0, ans=0.125 2023-10-10 06:37:12,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=271553.3333333333, ans=0.2 2023-10-10 06:37:17,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=271600.0, ans=0.0 2023-10-10 06:37:23,877 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.00 vs. limit=15.0 2023-10-10 06:37:47,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=271693.3333333333, ans=0.0 2023-10-10 06:37:51,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271740.0, ans=0.1 2023-10-10 06:38:08,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=271786.6666666667, ans=0.0 2023-10-10 06:38:09,957 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.716e+02 1.904e+02 2.245e+02 2.766e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-10 06:38:37,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271926.6666666667, ans=0.1 2023-10-10 06:38:47,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=271926.6666666667, ans=0.2 2023-10-10 06:38:50,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=271973.3333333333, ans=0.0 2023-10-10 06:39:08,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=272020.0, ans=0.125 2023-10-10 06:39:14,916 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.66 vs. limit=6.0 2023-10-10 06:39:44,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=272206.6666666667, ans=0.1 2023-10-10 06:40:01,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=272253.3333333333, ans=0.2 2023-10-10 06:40:07,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.870e+02 2.163e+02 2.587e+02 4.217e+02, threshold=4.327e+02, percent-clipped=1.0 2023-10-10 06:40:12,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=272300.0, ans=0.125 2023-10-10 06:40:47,623 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.65 vs. limit=15.0 2023-10-10 06:40:54,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=272440.0, ans=0.125 2023-10-10 06:40:59,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=272486.6666666667, ans=0.1 2023-10-10 06:41:03,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272486.6666666667, ans=0.1 2023-10-10 06:41:04,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=272486.6666666667, ans=0.1 2023-10-10 06:41:06,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=272486.6666666667, ans=0.1 2023-10-10 06:41:36,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=272626.6666666667, ans=0.2 2023-10-10 06:41:36,540 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-10-10 06:41:42,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=272626.6666666667, ans=0.125 2023-10-10 06:41:48,459 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.86 vs. limit=10.0 2023-10-10 06:42:10,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.775e+02 2.000e+02 2.230e+02 3.104e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-10 06:42:39,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=272860.0, ans=0.125 2023-10-10 06:42:43,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=272906.6666666667, ans=0.0 2023-10-10 06:42:53,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=272953.3333333333, ans=0.125 2023-10-10 06:43:06,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=273000.0, ans=0.09899494936611666 2023-10-10 06:43:07,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=273000.0, ans=0.0 2023-10-10 06:43:16,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=273046.6666666667, ans=0.125 2023-10-10 06:43:50,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=273186.6666666667, ans=0.125 2023-10-10 06:43:59,401 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-10-10 06:43:59,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.689e+02 1.990e+02 2.315e+02 4.038e+02, threshold=3.980e+02, percent-clipped=1.0 2023-10-10 06:44:14,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=273280.0, ans=0.0 2023-10-10 06:44:15,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=273280.0, ans=0.125 2023-10-10 06:44:18,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=273280.0, ans=0.1 2023-10-10 06:44:26,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=273326.6666666667, ans=0.0 2023-10-10 06:44:53,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=273420.0, ans=0.125 2023-10-10 06:45:20,904 INFO [train.py:1031] (0/4) Epoch 5, batch 4000, loss[loss=0.2381, simple_loss=0.2844, pruned_loss=0.09587, over 12631.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3162, pruned_loss=0.07688, over 28424914.70 frames. ], batch size: 440, lr: 7.89e-03, grad_scale: 32.0 2023-10-10 06:45:30,107 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.06 vs. limit=15.0 2023-10-10 06:45:43,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.15 vs. limit=15.0 2023-10-10 06:45:48,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273653.3333333333, ans=0.1 2023-10-10 06:45:51,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=273653.3333333333, ans=0.125 2023-10-10 06:45:56,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.819e+02 2.014e+02 2.211e+02 2.956e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-10 06:46:03,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=273700.0, ans=0.0 2023-10-10 06:46:10,724 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.33 vs. limit=22.5 2023-10-10 06:46:18,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=273793.3333333333, ans=0.125 2023-10-10 06:46:27,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=273793.3333333333, ans=0.125 2023-10-10 06:46:34,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=273840.0, ans=0.2 2023-10-10 06:46:55,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=273933.3333333333, ans=0.2 2023-10-10 06:47:04,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=273980.0, ans=0.025 2023-10-10 06:47:32,877 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.15 vs. limit=15.0 2023-10-10 06:47:35,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274073.3333333333, ans=0.1 2023-10-10 06:47:39,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=274120.0, ans=0.0 2023-10-10 06:47:46,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=274120.0, ans=0.125 2023-10-10 06:47:50,044 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.864e+02 2.132e+02 2.632e+02 3.693e+02, threshold=4.263e+02, percent-clipped=0.0 2023-10-10 06:47:50,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=274120.0, ans=0.0 2023-10-10 06:48:14,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=274213.3333333333, ans=0.125 2023-10-10 06:48:19,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=274260.0, ans=0.0 2023-10-10 06:48:31,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=274306.6666666667, ans=0.125 2023-10-10 06:48:49,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=274353.3333333333, ans=0.125 2023-10-10 06:48:59,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.94 vs. limit=15.0 2023-10-10 06:49:26,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.09 vs. limit=10.0 2023-10-10 06:49:40,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=274540.0, ans=0.2 2023-10-10 06:49:52,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=274586.6666666667, ans=0.0 2023-10-10 06:50:00,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.797e+02 2.056e+02 2.431e+02 3.962e+02, threshold=4.112e+02, percent-clipped=0.0 2023-10-10 06:50:15,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.61 vs. limit=15.0 2023-10-10 06:50:23,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.99 vs. limit=8.0 2023-10-10 06:50:45,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-10-10 06:50:47,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=274820.0, ans=0.125 2023-10-10 06:50:56,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274866.6666666667, ans=0.1 2023-10-10 06:51:15,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.41 vs. limit=15.0 2023-10-10 06:51:17,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=274913.3333333333, ans=0.0 2023-10-10 06:51:23,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=274960.0, ans=0.125 2023-10-10 06:51:45,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.79 vs. limit=6.0 2023-10-10 06:51:51,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=275053.3333333333, ans=0.2 2023-10-10 06:51:53,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.805e+02 2.049e+02 2.323e+02 3.623e+02, threshold=4.099e+02, percent-clipped=0.0 2023-10-10 06:51:57,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=275100.0, ans=0.2 2023-10-10 06:52:01,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-10-10 06:52:05,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=275100.0, ans=0.0 2023-10-10 06:52:09,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=275146.6666666667, ans=0.07 2023-10-10 06:52:16,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=275146.6666666667, ans=0.0 2023-10-10 06:52:16,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=275146.6666666667, ans=0.125 2023-10-10 06:52:18,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=275193.3333333333, ans=0.125 2023-10-10 06:52:19,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=275193.3333333333, ans=0.125 2023-10-10 06:52:25,970 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-10-10 06:52:48,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=275286.6666666667, ans=0.125 2023-10-10 06:52:58,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=275333.3333333333, ans=0.125 2023-10-10 06:53:08,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=275380.0, ans=0.07 2023-10-10 06:53:27,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=275473.3333333333, ans=0.125 2023-10-10 06:53:29,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-10-10 06:53:48,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.858e+02 2.069e+02 2.331e+02 3.898e+02, threshold=4.139e+02, percent-clipped=0.0 2023-10-10 06:53:51,572 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.65 vs. limit=22.5 2023-10-10 06:54:18,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=275613.3333333333, ans=0.125 2023-10-10 06:54:19,110 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.16 vs. limit=15.0 2023-10-10 06:54:20,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=275660.0, ans=0.1 2023-10-10 06:55:03,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=275800.0, ans=0.05 2023-10-10 06:55:16,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=275846.6666666667, ans=0.125 2023-10-10 06:55:19,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=275846.6666666667, ans=0.2 2023-10-10 06:55:21,074 INFO [train.py:1031] (0/4) Epoch 5, batch 4500, loss[loss=0.2154, simple_loss=0.3113, pruned_loss=0.05978, over 16927.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3161, pruned_loss=0.07657, over 29366966.03 frames. ], batch size: 104, lr: 7.86e-03, grad_scale: 32.0 2023-10-10 06:55:43,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=275986.6666666667, ans=0.125 2023-10-10 06:55:44,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=275986.6666666667, ans=0.1 2023-10-10 06:55:53,699 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.306e+02 1.732e+02 1.874e+02 2.074e+02 3.499e+02, threshold=3.748e+02, percent-clipped=0.0 2023-10-10 06:55:54,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=276033.3333333333, ans=0.1 2023-10-10 06:55:56,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=276033.3333333333, ans=0.0 2023-10-10 06:55:57,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=276033.3333333333, ans=0.0 2023-10-10 06:56:16,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=276126.6666666667, ans=0.2 2023-10-10 06:56:41,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.76 vs. limit=5.0 2023-10-10 06:56:51,731 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-10-10 06:57:06,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=276313.3333333333, ans=0.0 2023-10-10 06:57:08,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=276313.3333333333, ans=0.1 2023-10-10 06:57:16,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=276360.0, ans=0.0 2023-10-10 06:57:20,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=276406.6666666667, ans=0.2 2023-10-10 06:57:29,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=276406.6666666667, ans=0.0 2023-10-10 06:57:40,362 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.729e+02 1.852e+02 2.035e+02 2.827e+02, threshold=3.703e+02, percent-clipped=0.0 2023-10-10 06:57:42,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=276500.0, ans=0.0 2023-10-10 06:57:45,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.93 vs. limit=15.0 2023-10-10 06:57:52,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=276546.6666666667, ans=0.0 2023-10-10 06:57:53,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=276546.6666666667, ans=0.07 2023-10-10 06:57:57,741 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.02 vs. limit=15.0 2023-10-10 06:58:12,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=276593.3333333333, ans=0.125 2023-10-10 06:58:19,844 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=15.0 2023-10-10 06:58:33,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=276686.6666666667, ans=0.0 2023-10-10 06:58:52,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=276780.0, ans=0.05 2023-10-10 06:58:58,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=276780.0, ans=0.2 2023-10-10 06:59:14,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=276873.3333333333, ans=0.2 2023-10-10 06:59:22,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.13 vs. limit=15.0 2023-10-10 06:59:28,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=276920.0, ans=0.1 2023-10-10 06:59:28,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=276920.0, ans=0.125 2023-10-10 06:59:29,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.775e+02 1.948e+02 2.176e+02 3.021e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-10 07:00:00,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=277060.0, ans=0.125 2023-10-10 07:00:11,366 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.605e-03 2023-10-10 07:00:26,398 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.64 vs. limit=22.5 2023-10-10 07:00:35,736 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-10-10 07:00:38,439 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-10-10 07:01:04,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=277340.0, ans=0.125 2023-10-10 07:01:12,488 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.46 vs. limit=6.0 2023-10-10 07:01:19,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.781e+02 2.030e+02 2.372e+02 3.639e+02, threshold=4.060e+02, percent-clipped=0.0 2023-10-10 07:02:24,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=277666.6666666667, ans=0.125 2023-10-10 07:02:36,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=277760.0, ans=0.125 2023-10-10 07:02:45,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=277760.0, ans=0.07 2023-10-10 07:02:51,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=277806.6666666667, ans=0.1 2023-10-10 07:02:53,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=277806.6666666667, ans=0.125 2023-10-10 07:03:00,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=277806.6666666667, ans=0.05 2023-10-10 07:03:05,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=277853.3333333333, ans=0.125 2023-10-10 07:03:06,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277853.3333333333, ans=0.1 2023-10-10 07:03:10,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.863e+02 2.076e+02 2.543e+02 3.586e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-10 07:03:16,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=277900.0, ans=0.07 2023-10-10 07:03:19,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=277900.0, ans=0.5 2023-10-10 07:03:32,017 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.18 vs. limit=15.0 2023-10-10 07:03:39,452 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.39 vs. limit=15.0 2023-10-10 07:03:51,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=278040.0, ans=0.125 2023-10-10 07:03:57,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.94 vs. limit=15.0 2023-10-10 07:04:00,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=278040.0, ans=0.125 2023-10-10 07:04:07,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=278086.6666666667, ans=0.0 2023-10-10 07:04:10,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=278086.6666666667, ans=0.0 2023-10-10 07:04:35,614 INFO [train.py:1031] (0/4) Epoch 5, batch 5000, loss[loss=0.2728, simple_loss=0.3298, pruned_loss=0.108, over 15608.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3158, pruned_loss=0.0766, over 30117013.76 frames. ], batch size: 350, lr: 7.83e-03, grad_scale: 64.0 2023-10-10 07:05:07,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.790e+02 1.994e+02 2.169e+02 2.825e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-10 07:05:12,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=278366.6666666667, ans=0.1 2023-10-10 07:05:16,309 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.88 vs. limit=15.0 2023-10-10 07:05:34,940 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.43 vs. limit=22.5 2023-10-10 07:05:41,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.03 vs. limit=10.0 2023-10-10 07:06:18,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=278646.6666666667, ans=0.1 2023-10-10 07:06:21,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=278646.6666666667, ans=0.125 2023-10-10 07:06:25,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=278646.6666666667, ans=0.2 2023-10-10 07:06:57,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=278786.6666666667, ans=0.125 2023-10-10 07:07:00,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.803e+02 2.071e+02 2.479e+02 3.574e+02, threshold=4.143e+02, percent-clipped=0.0 2023-10-10 07:07:15,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.61 vs. limit=15.0 2023-10-10 07:07:19,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.55 vs. limit=15.0 2023-10-10 07:07:32,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=278926.6666666667, ans=0.2 2023-10-10 07:07:33,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=278926.6666666667, ans=0.125 2023-10-10 07:07:34,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=278973.3333333333, ans=0.0 2023-10-10 07:07:42,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=278973.3333333333, ans=0.0 2023-10-10 07:07:59,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=279066.6666666667, ans=0.0 2023-10-10 07:08:03,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=279066.6666666667, ans=0.125 2023-10-10 07:08:09,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=279113.3333333333, ans=0.125 2023-10-10 07:08:22,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=279160.0, ans=0.125 2023-10-10 07:08:28,335 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:08:37,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=279206.6666666667, ans=0.125 2023-10-10 07:08:41,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=279253.3333333333, ans=0.0 2023-10-10 07:08:46,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=279253.3333333333, ans=10.0 2023-10-10 07:08:48,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.727e+02 1.897e+02 2.147e+02 3.329e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-10 07:09:06,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=279346.6666666667, ans=0.0 2023-10-10 07:09:42,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.45 vs. limit=22.5 2023-10-10 07:09:47,248 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:09:50,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=279533.3333333333, ans=0.125 2023-10-10 07:10:03,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=279580.0, ans=0.125 2023-10-10 07:10:14,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=279626.6666666667, ans=0.1 2023-10-10 07:10:15,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279626.6666666667, ans=0.1 2023-10-10 07:10:16,745 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-10-10 07:10:24,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.61 vs. limit=6.0 2023-10-10 07:10:27,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=279673.3333333333, ans=0.1 2023-10-10 07:10:39,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=279720.0, ans=0.05 2023-10-10 07:10:45,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.694e+02 1.853e+02 2.096e+02 2.977e+02, threshold=3.706e+02, percent-clipped=0.0 2023-10-10 07:10:49,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.11 vs. limit=15.0 2023-10-10 07:10:53,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=279766.6666666667, ans=0.2 2023-10-10 07:10:54,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=279766.6666666667, ans=0.2 2023-10-10 07:11:04,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279813.3333333333, ans=0.1 2023-10-10 07:11:11,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=279860.0, ans=0.0 2023-10-10 07:11:32,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=279953.3333333333, ans=0.125 2023-10-10 07:11:36,438 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-10-10 07:11:40,286 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:11:44,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=280000.0, ans=0.2 2023-10-10 07:11:50,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=280000.0, ans=0.125 2023-10-10 07:12:02,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=280046.6666666667, ans=0.05 2023-10-10 07:12:14,052 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.48 vs. limit=15.0 2023-10-10 07:12:16,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=280140.0, ans=0.125 2023-10-10 07:12:22,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=280140.0, ans=0.125 2023-10-10 07:12:27,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=280186.6666666667, ans=0.125 2023-10-10 07:12:36,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.702e+02 1.883e+02 2.175e+02 4.677e+02, threshold=3.765e+02, percent-clipped=2.0 2023-10-10 07:13:07,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=280326.6666666667, ans=0.125 2023-10-10 07:13:09,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280373.3333333333, ans=0.1 2023-10-10 07:13:27,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.86 vs. limit=22.5 2023-10-10 07:13:35,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=280466.6666666667, ans=0.125 2023-10-10 07:13:37,357 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-10-10 07:13:51,515 INFO [train.py:1031] (0/4) Epoch 5, batch 5500, loss[loss=0.1985, simple_loss=0.2936, pruned_loss=0.05163, over 16847.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3154, pruned_loss=0.07625, over 30712989.57 frames. ], batch size: 98, lr: 7.80e-03, grad_scale: 32.0 2023-10-10 07:13:57,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=280560.0, ans=0.1 2023-10-10 07:13:57,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=280560.0, ans=0.125 2023-10-10 07:14:07,662 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-10 07:14:21,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.771e+02 1.983e+02 2.415e+02 3.797e+02, threshold=3.966e+02, percent-clipped=1.0 2023-10-10 07:14:46,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.28 vs. limit=22.5 2023-10-10 07:14:59,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=280840.0, ans=0.125 2023-10-10 07:15:11,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280886.6666666667, ans=0.1 2023-10-10 07:15:34,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=280980.0, ans=0.125 2023-10-10 07:15:44,561 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.18 vs. limit=15.0 2023-10-10 07:16:08,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.755e+02 1.958e+02 2.183e+02 3.078e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-10 07:16:19,298 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.39 vs. limit=10.0 2023-10-10 07:16:24,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=281213.3333333333, ans=0.125 2023-10-10 07:16:33,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=281260.0, ans=0.2 2023-10-10 07:16:40,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281260.0, ans=0.1 2023-10-10 07:16:42,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=281306.6666666667, ans=0.125 2023-10-10 07:16:47,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281306.6666666667, ans=0.1 2023-10-10 07:16:55,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=281353.3333333333, ans=0.125 2023-10-10 07:16:58,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281353.3333333333, ans=0.1 2023-10-10 07:17:22,687 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:17:23,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=281446.6666666667, ans=0.0 2023-10-10 07:17:24,146 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-10-10 07:17:29,968 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.20 vs. limit=15.0 2023-10-10 07:17:30,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.45 vs. limit=15.0 2023-10-10 07:17:35,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2023-10-10 07:17:37,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.96 vs. limit=15.0 2023-10-10 07:17:53,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=281586.6666666667, ans=0.0 2023-10-10 07:18:01,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.735e+02 1.947e+02 2.338e+02 3.500e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-10 07:18:09,637 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-10-10 07:18:13,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=281680.0, ans=0.0 2023-10-10 07:18:20,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=281680.0, ans=0.125 2023-10-10 07:18:29,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=281726.6666666667, ans=0.0 2023-10-10 07:18:31,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=281726.6666666667, ans=0.1 2023-10-10 07:18:44,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=281773.3333333333, ans=0.125 2023-10-10 07:19:18,519 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.48 vs. limit=22.5 2023-10-10 07:19:19,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=281913.3333333333, ans=0.2 2023-10-10 07:19:25,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=281960.0, ans=0.2 2023-10-10 07:19:55,752 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.844e+02 2.046e+02 2.399e+02 3.521e+02, threshold=4.092e+02, percent-clipped=0.0 2023-10-10 07:20:06,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=1.97 vs. limit=15.0 2023-10-10 07:20:09,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=282146.6666666667, ans=0.1 2023-10-10 07:20:09,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.05 vs. limit=22.5 2023-10-10 07:20:27,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=282193.3333333333, ans=0.0 2023-10-10 07:20:45,960 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.48 vs. limit=15.0 2023-10-10 07:20:49,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=282286.6666666667, ans=0.2 2023-10-10 07:20:51,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.55 vs. limit=22.5 2023-10-10 07:21:06,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=282380.0, ans=0.0 2023-10-10 07:21:13,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282380.0, ans=0.1 2023-10-10 07:21:20,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=282426.6666666667, ans=0.125 2023-10-10 07:21:40,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=282520.0, ans=0.1 2023-10-10 07:21:49,772 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.741e+02 1.906e+02 2.171e+02 3.169e+02, threshold=3.812e+02, percent-clipped=0.0 2023-10-10 07:21:50,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=282566.6666666667, ans=0.95 2023-10-10 07:21:55,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=282566.6666666667, ans=0.2 2023-10-10 07:21:56,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=282566.6666666667, ans=0.07 2023-10-10 07:21:58,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282566.6666666667, ans=0.1 2023-10-10 07:22:39,310 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:22:40,741 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-10-10 07:22:40,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.18 vs. limit=15.0 2023-10-10 07:22:44,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=282800.0, ans=0.125 2023-10-10 07:22:59,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=282846.6666666667, ans=0.125 2023-10-10 07:23:06,548 INFO [train.py:1031] (0/4) Epoch 5, batch 6000, loss[loss=0.2304, simple_loss=0.3105, pruned_loss=0.07511, over 16888.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3153, pruned_loss=0.07618, over 31153009.29 frames. ], batch size: 146, lr: 7.77e-03, grad_scale: 32.0 2023-10-10 07:23:19,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=282940.0, ans=0.125 2023-10-10 07:23:28,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=282986.6666666667, ans=0.1 2023-10-10 07:23:30,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-10-10 07:23:37,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=282986.6666666667, ans=0.125 2023-10-10 07:23:39,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.785e+02 1.973e+02 2.403e+02 3.873e+02, threshold=3.946e+02, percent-clipped=1.0 2023-10-10 07:23:48,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=283033.3333333333, ans=0.1 2023-10-10 07:23:49,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=283080.0, ans=0.125 2023-10-10 07:23:56,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=283080.0, ans=0.0 2023-10-10 07:24:03,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=283126.6666666667, ans=0.1 2023-10-10 07:24:19,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=283173.3333333333, ans=0.04949747468305833 2023-10-10 07:24:38,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=283266.6666666667, ans=0.125 2023-10-10 07:24:40,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=283266.6666666667, ans=0.1 2023-10-10 07:25:00,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=283360.0, ans=0.125 2023-10-10 07:25:28,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.722e+02 1.896e+02 2.244e+02 3.122e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-10 07:25:30,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-10-10 07:25:32,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=283500.0, ans=0.125 2023-10-10 07:25:36,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=283500.0, ans=0.125 2023-10-10 07:25:43,774 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.96 vs. limit=22.5 2023-10-10 07:25:47,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=283546.6666666667, ans=0.125 2023-10-10 07:25:49,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=283546.6666666667, ans=0.125 2023-10-10 07:25:57,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=283593.3333333333, ans=0.125 2023-10-10 07:25:58,081 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.34 vs. limit=15.0 2023-10-10 07:26:02,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-10-10 07:26:07,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=283640.0, ans=0.125 2023-10-10 07:26:14,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=283686.6666666667, ans=0.125 2023-10-10 07:26:17,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=283686.6666666667, ans=0.2 2023-10-10 07:26:54,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.40 vs. limit=15.0 2023-10-10 07:27:03,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.38 vs. limit=12.0 2023-10-10 07:27:04,824 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:27:19,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.812e+02 2.050e+02 2.352e+02 3.885e+02, threshold=4.099e+02, percent-clipped=1.0 2023-10-10 07:27:23,970 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-10-10 07:28:18,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=284200.0, ans=0.0 2023-10-10 07:28:19,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=284200.0, ans=0.0 2023-10-10 07:28:20,034 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:28:21,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-10-10 07:28:39,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=284293.3333333333, ans=0.0 2023-10-10 07:28:46,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.30 vs. limit=15.0 2023-10-10 07:28:51,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=284340.0, ans=0.125 2023-10-10 07:28:52,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=284340.0, ans=0.125 2023-10-10 07:29:00,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=284386.6666666667, ans=0.125 2023-10-10 07:29:02,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=284386.6666666667, ans=0.1 2023-10-10 07:29:07,187 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.849e+02 2.039e+02 2.474e+02 4.305e+02, threshold=4.078e+02, percent-clipped=1.0 2023-10-10 07:29:11,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=284433.3333333333, ans=0.125 2023-10-10 07:29:16,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.49 vs. limit=22.5 2023-10-10 07:29:19,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=284480.0, ans=0.125 2023-10-10 07:29:20,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=284480.0, ans=0.125 2023-10-10 07:29:36,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=284526.6666666667, ans=0.125 2023-10-10 07:29:42,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=284573.3333333333, ans=0.125 2023-10-10 07:29:46,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=284573.3333333333, ans=0.0 2023-10-10 07:30:05,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=284620.0, ans=0.125 2023-10-10 07:30:35,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=284760.0, ans=0.07 2023-10-10 07:31:05,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.284e+02 1.687e+02 1.799e+02 1.991e+02 2.851e+02, threshold=3.597e+02, percent-clipped=0.0 2023-10-10 07:31:05,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=284900.0, ans=0.125 2023-10-10 07:31:13,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=284946.6666666667, ans=0.2 2023-10-10 07:31:17,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=284946.6666666667, ans=0.09899494936611666 2023-10-10 07:31:23,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=284946.6666666667, ans=0.035 2023-10-10 07:31:46,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285086.6666666667, ans=0.1 2023-10-10 07:32:01,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=285133.3333333333, ans=0.125 2023-10-10 07:32:07,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.17 vs. limit=15.0 2023-10-10 07:32:08,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.85 vs. limit=10.0 2023-10-10 07:32:22,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=285226.6666666667, ans=0.0 2023-10-10 07:32:22,917 INFO [train.py:1031] (0/4) Epoch 5, batch 6500, loss[loss=0.238, simple_loss=0.327, pruned_loss=0.07447, over 16963.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3157, pruned_loss=0.0764, over 31501546.12 frames. ], batch size: 93, lr: 7.73e-03, grad_scale: 32.0 2023-10-10 07:32:30,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=285226.6666666667, ans=0.0 2023-10-10 07:32:31,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=285226.6666666667, ans=0.0 2023-10-10 07:32:35,844 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.11 vs. limit=10.0 2023-10-10 07:32:37,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.91 vs. limit=15.0 2023-10-10 07:32:50,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=285320.0, ans=0.0 2023-10-10 07:32:52,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=285320.0, ans=0.035 2023-10-10 07:33:01,448 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.830e+02 2.204e+02 2.593e+02 4.019e+02, threshold=4.408e+02, percent-clipped=5.0 2023-10-10 07:33:12,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285366.6666666667, ans=0.1 2023-10-10 07:33:33,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.53 vs. limit=15.0 2023-10-10 07:34:24,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=285693.3333333333, ans=0.0 2023-10-10 07:34:28,784 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=12.0 2023-10-10 07:34:53,977 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.660e+02 1.921e+02 2.185e+02 4.191e+02, threshold=3.841e+02, percent-clipped=0.0 2023-10-10 07:35:01,922 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=15.0 2023-10-10 07:35:20,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285926.6666666667, ans=0.1 2023-10-10 07:35:26,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=285973.3333333333, ans=0.125 2023-10-10 07:35:41,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=286020.0, ans=0.05 2023-10-10 07:35:58,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=286113.3333333333, ans=0.125 2023-10-10 07:36:12,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=286160.0, ans=0.125 2023-10-10 07:36:24,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=286206.6666666667, ans=0.125 2023-10-10 07:36:33,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=286253.3333333333, ans=0.0 2023-10-10 07:36:39,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.724e+02 1.910e+02 2.174e+02 3.138e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-10 07:36:48,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.17 vs. limit=15.0 2023-10-10 07:37:06,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=286393.3333333333, ans=0.0 2023-10-10 07:37:07,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=286393.3333333333, ans=0.2 2023-10-10 07:37:16,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.84 vs. limit=15.0 2023-10-10 07:37:36,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=286533.3333333333, ans=0.05 2023-10-10 07:37:36,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286533.3333333333, ans=0.1 2023-10-10 07:37:53,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-10-10 07:38:02,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=286626.6666666667, ans=0.125 2023-10-10 07:38:04,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=286626.6666666667, ans=0.5 2023-10-10 07:38:13,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=286626.6666666667, ans=0.0 2023-10-10 07:38:46,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.294e+02 1.794e+02 2.069e+02 2.480e+02 4.665e+02, threshold=4.138e+02, percent-clipped=2.0 2023-10-10 07:38:51,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=286766.6666666667, ans=0.125 2023-10-10 07:39:00,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.78 vs. limit=15.0 2023-10-10 07:39:03,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=286813.3333333333, ans=0.125 2023-10-10 07:39:24,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286906.6666666667, ans=0.1 2023-10-10 07:39:34,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286953.3333333333, ans=0.1 2023-10-10 07:39:44,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=287000.0, ans=0.125 2023-10-10 07:40:01,268 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-10-10 07:40:02,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=287046.6666666667, ans=0.0 2023-10-10 07:40:14,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=287093.3333333333, ans=0.125 2023-10-10 07:40:30,306 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-10-10 07:40:37,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.752e+02 2.177e+02 2.583e+02 3.713e+02, threshold=4.354e+02, percent-clipped=0.0 2023-10-10 07:40:43,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=287233.3333333333, ans=0.0 2023-10-10 07:40:45,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=287233.3333333333, ans=0.125 2023-10-10 07:40:48,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=287280.0, ans=0.125 2023-10-10 07:40:50,275 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.33 vs. limit=10.0 2023-10-10 07:40:51,199 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.41 vs. limit=15.0 2023-10-10 07:41:10,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.51 vs. limit=22.5 2023-10-10 07:41:11,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=287373.3333333333, ans=0.125 2023-10-10 07:41:17,651 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:41:20,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=287420.0, ans=0.0 2023-10-10 07:41:30,656 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.02 vs. limit=15.0 2023-10-10 07:41:43,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.21 vs. limit=15.0 2023-10-10 07:41:51,257 INFO [train.py:1031] (0/4) Epoch 5, batch 7000, loss[loss=0.233, simple_loss=0.3175, pruned_loss=0.07423, over 16791.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3159, pruned_loss=0.07621, over 31789171.05 frames. ], batch size: 175, lr: 7.70e-03, grad_scale: 32.0 2023-10-10 07:41:58,650 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.73 vs. limit=15.0 2023-10-10 07:42:06,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.30 vs. limit=15.0 2023-10-10 07:42:21,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=287653.3333333333, ans=0.125 2023-10-10 07:42:23,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287653.3333333333, ans=0.1 2023-10-10 07:42:25,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=287653.3333333333, ans=0.125 2023-10-10 07:42:28,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.830e+02 2.100e+02 2.328e+02 3.721e+02, threshold=4.200e+02, percent-clipped=0.0 2023-10-10 07:42:44,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=287746.6666666667, ans=0.125 2023-10-10 07:42:59,241 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-10-10 07:43:08,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=287840.0, ans=0.125 2023-10-10 07:43:37,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=287980.0, ans=0.1 2023-10-10 07:43:39,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=287980.0, ans=0.125 2023-10-10 07:43:41,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=288026.6666666667, ans=0.0 2023-10-10 07:43:54,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=15.0 2023-10-10 07:43:57,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=288073.3333333333, ans=0.2 2023-10-10 07:44:15,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.756e+02 2.098e+02 2.489e+02 3.991e+02, threshold=4.195e+02, percent-clipped=0.0 2023-10-10 07:44:18,796 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.37 vs. limit=15.0 2023-10-10 07:44:26,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=288213.3333333333, ans=0.1 2023-10-10 07:44:44,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=288260.0, ans=0.125 2023-10-10 07:44:50,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.41 vs. limit=15.0 2023-10-10 07:45:30,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=288493.3333333333, ans=0.1 2023-10-10 07:45:37,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=288493.3333333333, ans=0.125 2023-10-10 07:46:11,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=288586.6666666667, ans=0.2 2023-10-10 07:46:15,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.834e+02 2.018e+02 2.506e+02 4.228e+02, threshold=4.035e+02, percent-clipped=1.0 2023-10-10 07:46:51,026 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:47:04,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=288773.3333333333, ans=0.0 2023-10-10 07:47:17,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2023-10-10 07:47:18,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2023-10-10 07:47:36,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=288913.3333333333, ans=0.125 2023-10-10 07:47:37,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=288960.0, ans=0.125 2023-10-10 07:47:56,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=289006.6666666667, ans=0.125 2023-10-10 07:48:02,151 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.97 vs. limit=10.0 2023-10-10 07:48:13,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=289053.3333333333, ans=0.125 2023-10-10 07:48:15,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.697e+02 1.903e+02 2.208e+02 3.102e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-10 07:48:18,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=289100.0, ans=0.0 2023-10-10 07:48:25,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=289100.0, ans=0.0 2023-10-10 07:48:43,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=289193.3333333333, ans=0.125 2023-10-10 07:48:44,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=289193.3333333333, ans=0.125 2023-10-10 07:48:50,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.89 vs. limit=22.5 2023-10-10 07:49:04,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.06 vs. limit=22.5 2023-10-10 07:49:13,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=289333.3333333333, ans=0.125 2023-10-10 07:49:24,287 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:49:33,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=289426.6666666667, ans=0.125 2023-10-10 07:49:37,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=12.0 2023-10-10 07:49:40,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=289426.6666666667, ans=0.125 2023-10-10 07:49:51,971 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.57 vs. limit=10.0 2023-10-10 07:50:00,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289520.0, ans=0.1 2023-10-10 07:50:00,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=289520.0, ans=0.2 2023-10-10 07:50:04,520 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.737e+02 1.948e+02 2.143e+02 3.215e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-10 07:50:08,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=289566.6666666667, ans=0.125 2023-10-10 07:50:26,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=289660.0, ans=0.125 2023-10-10 07:50:38,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289706.6666666667, ans=0.1 2023-10-10 07:50:44,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=289753.3333333333, ans=0.125 2023-10-10 07:50:50,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=289753.3333333333, ans=0.0 2023-10-10 07:50:54,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=289800.0, ans=0.95 2023-10-10 07:51:16,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=289846.6666666667, ans=0.0 2023-10-10 07:51:18,482 INFO [train.py:1031] (0/4) Epoch 5, batch 7500, loss[loss=0.2479, simple_loss=0.323, pruned_loss=0.08642, over 16598.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3157, pruned_loss=0.07623, over 31994533.57 frames. ], batch size: 66, lr: 7.67e-03, grad_scale: 32.0 2023-10-10 07:51:50,332 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.794e+02 1.999e+02 2.279e+02 3.344e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-10 07:51:57,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=290033.3333333333, ans=0.125 2023-10-10 07:51:59,387 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.97 vs. limit=15.0 2023-10-10 07:52:06,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.47 vs. limit=10.0 2023-10-10 07:52:12,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=290126.6666666667, ans=0.125 2023-10-10 07:52:14,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-10-10 07:52:27,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=290173.3333333333, ans=0.125 2023-10-10 07:52:31,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=290173.3333333333, ans=0.125 2023-10-10 07:53:28,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=290406.6666666667, ans=0.125 2023-10-10 07:53:32,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=290453.3333333333, ans=0.125 2023-10-10 07:53:39,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=290453.3333333333, ans=0.125 2023-10-10 07:53:40,998 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.76 vs. limit=12.0 2023-10-10 07:53:45,781 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:53:47,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.74 vs. limit=22.5 2023-10-10 07:53:48,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.270e+02 1.710e+02 1.882e+02 2.087e+02 2.889e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-10 07:53:55,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=290500.0, ans=0.2 2023-10-10 07:54:02,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290546.6666666667, ans=0.1 2023-10-10 07:54:06,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=290546.6666666667, ans=0.125 2023-10-10 07:54:17,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=290593.3333333333, ans=0.0 2023-10-10 07:54:26,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=290640.0, ans=0.125 2023-10-10 07:54:27,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=290640.0, ans=0.0 2023-10-10 07:54:32,349 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.91 vs. limit=15.0 2023-10-10 07:55:18,449 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-10-10 07:55:42,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.742e+02 1.964e+02 2.310e+02 3.096e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-10 07:55:49,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=290966.6666666667, ans=0.1 2023-10-10 07:56:02,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=291060.0, ans=0.1 2023-10-10 07:56:15,416 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:56:23,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=291153.3333333333, ans=0.125 2023-10-10 07:56:25,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=291153.3333333333, ans=0.125 2023-10-10 07:56:32,503 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.51 vs. limit=22.5 2023-10-10 07:56:35,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=291200.0, ans=0.05 2023-10-10 07:57:27,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=291386.6666666667, ans=0.0 2023-10-10 07:57:31,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.758e+02 1.955e+02 2.122e+02 3.376e+02, threshold=3.911e+02, percent-clipped=0.0 2023-10-10 07:57:55,457 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:58:16,570 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 07:58:22,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=291620.0, ans=0.0 2023-10-10 07:58:30,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=291666.6666666667, ans=10.0 2023-10-10 07:58:50,387 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.46 vs. limit=15.0 2023-10-10 07:58:59,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=291760.0, ans=0.0 2023-10-10 07:59:01,176 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.93 vs. limit=22.5 2023-10-10 07:59:09,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.35 vs. limit=15.0 2023-10-10 07:59:24,884 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.735e+02 1.905e+02 2.268e+02 3.127e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-10 07:59:34,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=291946.6666666667, ans=0.0 2023-10-10 08:00:20,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=292086.6666666667, ans=0.125 2023-10-10 08:00:35,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=292180.0, ans=0.09899494936611666 2023-10-10 08:00:35,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=292180.0, ans=0.1 2023-10-10 08:00:41,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=292180.0, ans=0.09899494936611666 2023-10-10 08:00:43,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.43 vs. limit=22.5 2023-10-10 08:00:45,265 INFO [train.py:1031] (0/4) Epoch 5, batch 8000, loss[loss=0.2137, simple_loss=0.3012, pruned_loss=0.06313, over 16716.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3145, pruned_loss=0.07529, over 32165197.43 frames. ], batch size: 61, lr: 7.64e-03, grad_scale: 64.0 2023-10-10 08:00:45,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=292226.6666666667, ans=0.07 2023-10-10 08:01:10,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-10-10 08:01:13,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=292320.0, ans=0.0 2023-10-10 08:01:15,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=292320.0, ans=0.0 2023-10-10 08:01:16,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.612e+02 1.755e+02 2.017e+02 2.749e+02, threshold=3.510e+02, percent-clipped=0.0 2023-10-10 08:01:16,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=292366.6666666667, ans=0.2 2023-10-10 08:01:17,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=292366.6666666667, ans=0.2 2023-10-10 08:01:44,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=292460.0, ans=0.125 2023-10-10 08:02:17,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=292600.0, ans=0.07 2023-10-10 08:02:17,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=292600.0, ans=0.1 2023-10-10 08:02:27,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=292646.6666666667, ans=0.0 2023-10-10 08:02:57,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=292786.6666666667, ans=0.125 2023-10-10 08:02:59,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=292786.6666666667, ans=0.0 2023-10-10 08:03:03,088 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.804e+02 2.084e+02 2.499e+02 4.118e+02, threshold=4.169e+02, percent-clipped=4.0 2023-10-10 08:03:10,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=292833.3333333333, ans=0.2 2023-10-10 08:03:15,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=292880.0, ans=0.125 2023-10-10 08:03:20,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=292880.0, ans=0.0 2023-10-10 08:03:23,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=292880.0, ans=0.125 2023-10-10 08:03:45,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=292926.6666666667, ans=0.0 2023-10-10 08:04:07,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=293020.0, ans=0.0 2023-10-10 08:04:19,101 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:04:48,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=293160.0, ans=0.2 2023-10-10 08:04:50,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.21 vs. limit=5.0 2023-10-10 08:04:51,301 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.68 vs. limit=15.0 2023-10-10 08:05:04,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=293253.3333333333, ans=0.035 2023-10-10 08:05:07,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=293253.3333333333, ans=0.125 2023-10-10 08:05:10,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.54 vs. limit=22.5 2023-10-10 08:05:14,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.807e+02 2.104e+02 2.432e+02 3.171e+02, threshold=4.208e+02, percent-clipped=0.0 2023-10-10 08:05:26,612 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.67 vs. limit=15.0 2023-10-10 08:05:30,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=293346.6666666667, ans=0.0 2023-10-10 08:05:59,976 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.60 vs. limit=15.0 2023-10-10 08:06:02,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=293486.6666666667, ans=0.0 2023-10-10 08:06:03,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-10-10 08:06:07,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=293533.3333333333, ans=0.0 2023-10-10 08:06:10,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.12 vs. limit=15.0 2023-10-10 08:06:28,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=293580.0, ans=10.0 2023-10-10 08:06:38,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=293626.6666666667, ans=0.125 2023-10-10 08:06:38,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=293626.6666666667, ans=0.05 2023-10-10 08:07:02,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=293720.0, ans=0.1 2023-10-10 08:07:06,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.725e+02 1.905e+02 2.247e+02 3.014e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-10 08:07:09,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=293766.6666666667, ans=0.0 2023-10-10 08:07:12,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=293766.6666666667, ans=0.0 2023-10-10 08:07:16,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293813.3333333333, ans=0.1 2023-10-10 08:07:18,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=293813.3333333333, ans=0.0 2023-10-10 08:07:23,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-10-10 08:07:26,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=293860.0, ans=0.0 2023-10-10 08:07:27,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=293860.0, ans=0.0 2023-10-10 08:07:27,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293860.0, ans=0.1 2023-10-10 08:07:32,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=293860.0, ans=0.125 2023-10-10 08:07:37,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=293906.6666666667, ans=0.0 2023-10-10 08:07:37,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.50 vs. limit=12.0 2023-10-10 08:07:43,751 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:07:44,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=8.0 2023-10-10 08:07:50,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=293953.3333333333, ans=0.0 2023-10-10 08:07:55,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=293953.3333333333, ans=0.1 2023-10-10 08:08:03,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=294000.0, ans=0.125 2023-10-10 08:08:26,194 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.55 vs. limit=15.0 2023-10-10 08:08:34,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294140.0, ans=0.1 2023-10-10 08:08:37,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=294140.0, ans=0.0 2023-10-10 08:08:45,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=294186.6666666667, ans=0.0 2023-10-10 08:08:51,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=294186.6666666667, ans=0.125 2023-10-10 08:08:52,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294186.6666666667, ans=0.1 2023-10-10 08:09:01,139 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.759e+02 1.939e+02 2.509e+02 3.597e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-10 08:09:02,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=294233.3333333333, ans=0.0 2023-10-10 08:09:28,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=294326.6666666667, ans=0.2 2023-10-10 08:09:48,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=294420.0, ans=0.125 2023-10-10 08:09:50,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=294420.0, ans=0.0 2023-10-10 08:10:01,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=294466.6666666667, ans=0.025 2023-10-10 08:10:14,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=294513.3333333333, ans=0.0 2023-10-10 08:10:15,983 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:10:19,523 INFO [train.py:1031] (0/4) Epoch 5, batch 8500, loss[loss=0.2311, simple_loss=0.3213, pruned_loss=0.07044, over 16835.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3147, pruned_loss=0.07508, over 32324643.51 frames. ], batch size: 146, lr: 7.61e-03, grad_scale: 32.0 2023-10-10 08:10:32,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=294606.6666666667, ans=0.07 2023-10-10 08:10:43,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=294653.3333333333, ans=0.0 2023-10-10 08:10:47,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=294653.3333333333, ans=0.0 2023-10-10 08:10:51,369 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:10:55,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.854e+02 2.059e+02 2.413e+02 3.671e+02, threshold=4.118e+02, percent-clipped=0.0 2023-10-10 08:10:58,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=294700.0, ans=0.0 2023-10-10 08:11:11,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=294746.6666666667, ans=0.0 2023-10-10 08:11:16,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=294793.3333333333, ans=0.0 2023-10-10 08:11:23,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=294793.3333333333, ans=0.125 2023-10-10 08:11:26,267 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-10-10 08:11:31,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=294840.0, ans=0.1 2023-10-10 08:11:53,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=294933.3333333333, ans=0.125 2023-10-10 08:12:29,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=295073.3333333333, ans=0.0 2023-10-10 08:12:36,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=295073.3333333333, ans=0.125 2023-10-10 08:12:51,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.730e+02 2.024e+02 2.456e+02 4.460e+02, threshold=4.048e+02, percent-clipped=2.0 2023-10-10 08:13:18,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=295260.0, ans=0.1 2023-10-10 08:13:26,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=295306.6666666667, ans=0.0 2023-10-10 08:13:46,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=295353.3333333333, ans=0.125 2023-10-10 08:13:55,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=295400.0, ans=0.0 2023-10-10 08:13:59,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=295400.0, ans=0.1 2023-10-10 08:13:59,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=295400.0, ans=0.0 2023-10-10 08:14:03,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.90 vs. limit=15.0 2023-10-10 08:14:25,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=295493.3333333333, ans=0.125 2023-10-10 08:14:32,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=295540.0, ans=0.125 2023-10-10 08:14:37,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=295586.6666666667, ans=0.0 2023-10-10 08:14:37,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=295586.6666666667, ans=0.125 2023-10-10 08:14:51,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=295633.3333333333, ans=0.125 2023-10-10 08:14:53,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.694e+02 1.867e+02 2.096e+02 3.661e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-10 08:15:09,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=295680.0, ans=0.125 2023-10-10 08:15:10,296 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:15:22,383 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.52 vs. limit=10.0 2023-10-10 08:15:31,845 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.77 vs. limit=15.0 2023-10-10 08:15:35,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=22.5 2023-10-10 08:15:43,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=295820.0, ans=0.125 2023-10-10 08:15:47,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2023-10-10 08:16:12,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=295913.3333333333, ans=0.125 2023-10-10 08:16:27,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=296006.6666666667, ans=0.0 2023-10-10 08:16:39,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=296053.3333333333, ans=0.125 2023-10-10 08:16:42,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=296053.3333333333, ans=0.2 2023-10-10 08:16:47,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=296100.0, ans=0.125 2023-10-10 08:16:48,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.181e+02 1.705e+02 1.934e+02 2.556e+02 3.710e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-10 08:16:59,874 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:17:02,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=296146.6666666667, ans=0.125 2023-10-10 08:17:33,811 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.51 vs. limit=10.0 2023-10-10 08:17:36,477 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.66 vs. limit=6.0 2023-10-10 08:17:40,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.39 vs. limit=10.0 2023-10-10 08:18:01,004 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:18:29,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=296520.0, ans=0.0 2023-10-10 08:18:37,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.745e+02 1.930e+02 2.121e+02 3.550e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-10 08:18:44,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-10 08:18:49,839 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=22.5 2023-10-10 08:19:06,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-10-10 08:19:13,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=296706.6666666667, ans=0.125 2023-10-10 08:19:18,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=296753.3333333333, ans=0.0 2023-10-10 08:19:27,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=296800.0, ans=0.0 2023-10-10 08:19:33,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296800.0, ans=0.1 2023-10-10 08:19:34,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.24 vs. limit=15.0 2023-10-10 08:19:51,039 INFO [train.py:1031] (0/4) Epoch 5, batch 9000, loss[loss=0.2496, simple_loss=0.332, pruned_loss=0.08367, over 16631.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3141, pruned_loss=0.07474, over 32436253.77 frames. ], batch size: 202, lr: 7.58e-03, grad_scale: 16.0 2023-10-10 08:19:51,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=296893.3333333333, ans=10.0 2023-10-10 08:19:53,508 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.49 vs. limit=15.0 2023-10-10 08:20:01,797 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-10-10 08:20:13,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=296986.6666666667, ans=0.125 2023-10-10 08:20:15,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=296986.6666666667, ans=0.0 2023-10-10 08:20:23,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=297033.3333333333, ans=0.125 2023-10-10 08:20:24,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=297033.3333333333, ans=0.125 2023-10-10 08:20:25,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=297033.3333333333, ans=0.125 2023-10-10 08:20:26,236 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.828e+02 1.977e+02 2.231e+02 3.067e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-10 08:20:33,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=297080.0, ans=0.2 2023-10-10 08:20:48,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=297126.6666666667, ans=0.2 2023-10-10 08:20:51,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=297126.6666666667, ans=0.2 2023-10-10 08:21:12,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=297220.0, ans=0.0 2023-10-10 08:21:13,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=297266.6666666667, ans=0.0 2023-10-10 08:21:15,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=297266.6666666667, ans=0.0 2023-10-10 08:21:46,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=297406.6666666667, ans=0.125 2023-10-10 08:21:49,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=297406.6666666667, ans=0.125 2023-10-10 08:21:54,736 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.27 vs. limit=12.0 2023-10-10 08:22:00,474 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.61 vs. limit=10.0 2023-10-10 08:22:08,809 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.704e+02 1.862e+02 2.145e+02 3.676e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-10 08:22:11,786 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.45 vs. limit=10.0 2023-10-10 08:22:13,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=297500.0, ans=0.0 2023-10-10 08:22:24,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.75 vs. limit=5.0 2023-10-10 08:22:27,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=297593.3333333333, ans=0.125 2023-10-10 08:22:33,703 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=8.0 2023-10-10 08:23:07,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=297780.0, ans=0.04949747468305833 2023-10-10 08:23:19,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.70 vs. limit=15.0 2023-10-10 08:23:35,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=297873.3333333333, ans=0.0 2023-10-10 08:23:39,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=12.0 2023-10-10 08:23:43,445 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:23:48,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.46 vs. limit=22.5 2023-10-10 08:23:51,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.899e+02 2.023e+02 2.346e+02 3.303e+02, threshold=4.046e+02, percent-clipped=0.0 2023-10-10 08:23:56,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-10-10 08:24:02,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=298013.3333333333, ans=0.125 2023-10-10 08:24:09,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=298060.0, ans=0.125 2023-10-10 08:24:09,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=298060.0, ans=0.95 2023-10-10 08:24:21,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=298106.6666666667, ans=0.0 2023-10-10 08:24:28,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=298106.6666666667, ans=0.2 2023-10-10 08:24:30,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=298153.3333333333, ans=0.125 2023-10-10 08:24:31,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.62 vs. limit=10.0 2023-10-10 08:24:34,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=298153.3333333333, ans=0.0 2023-10-10 08:24:41,360 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.08 vs. limit=15.0 2023-10-10 08:25:01,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=298293.3333333333, ans=0.0 2023-10-10 08:25:37,052 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.800e+02 1.994e+02 2.234e+02 3.160e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-10 08:25:47,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=298480.0, ans=0.0 2023-10-10 08:26:00,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=298526.6666666667, ans=0.125 2023-10-10 08:26:17,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=298573.3333333333, ans=0.125 2023-10-10 08:26:22,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=298573.3333333333, ans=0.95 2023-10-10 08:26:32,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=298620.0, ans=0.1 2023-10-10 08:26:37,337 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-64000.pt 2023-10-10 08:27:08,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=298760.0, ans=0.125 2023-10-10 08:27:39,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.875e+02 2.129e+02 2.516e+02 3.765e+02, threshold=4.257e+02, percent-clipped=0.0 2023-10-10 08:27:57,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=298993.3333333333, ans=0.125 2023-10-10 08:28:26,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=299086.6666666667, ans=0.0 2023-10-10 08:28:56,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=299226.6666666667, ans=0.05 2023-10-10 08:28:57,034 INFO [train.py:1031] (0/4) Epoch 5, batch 9500, loss[loss=0.2606, simple_loss=0.3338, pruned_loss=0.09366, over 15822.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3147, pruned_loss=0.07524, over 32469097.52 frames. ], batch size: 350, lr: 7.55e-03, grad_scale: 32.0 2023-10-10 08:28:57,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=299226.6666666667, ans=0.1 2023-10-10 08:28:57,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=299226.6666666667, ans=0.0 2023-10-10 08:28:58,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=299226.6666666667, ans=0.125 2023-10-10 08:29:16,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=299273.3333333333, ans=0.07 2023-10-10 08:29:18,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=299320.0, ans=0.125 2023-10-10 08:29:31,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.714e+02 1.907e+02 2.140e+02 2.989e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-10 08:29:36,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=299366.6666666667, ans=0.0 2023-10-10 08:29:44,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-10-10 08:29:48,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.80 vs. limit=10.0 2023-10-10 08:29:51,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=299460.0, ans=0.125 2023-10-10 08:29:54,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=15.0 2023-10-10 08:30:03,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=299506.6666666667, ans=0.125 2023-10-10 08:30:06,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=299506.6666666667, ans=0.125 2023-10-10 08:30:08,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.28 vs. limit=6.0 2023-10-10 08:30:13,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=299553.3333333333, ans=0.125 2023-10-10 08:30:48,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=299693.3333333333, ans=0.1 2023-10-10 08:30:55,437 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.42 vs. limit=15.0 2023-10-10 08:31:01,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=299740.0, ans=0.0 2023-10-10 08:31:22,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.772e+02 1.968e+02 2.304e+02 4.066e+02, threshold=3.936e+02, percent-clipped=1.0 2023-10-10 08:31:26,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=299833.3333333333, ans=0.125 2023-10-10 08:31:26,729 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.17 vs. limit=15.0 2023-10-10 08:31:30,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=299880.0, ans=0.125 2023-10-10 08:31:38,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=299880.0, ans=0.05 2023-10-10 08:31:42,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=299926.6666666667, ans=0.0 2023-10-10 08:31:47,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=299926.6666666667, ans=0.125 2023-10-10 08:31:54,593 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.12 vs. limit=10.0 2023-10-10 08:31:55,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=299973.3333333333, ans=0.0 2023-10-10 08:32:16,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.22 vs. limit=15.0 2023-10-10 08:32:23,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=300066.6666666667, ans=0.125 2023-10-10 08:32:23,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=300066.6666666667, ans=0.2 2023-10-10 08:32:30,477 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.97 vs. limit=6.0 2023-10-10 08:32:32,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300113.3333333333, ans=0.1 2023-10-10 08:32:35,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=300113.3333333333, ans=0.125 2023-10-10 08:32:39,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=300160.0, ans=0.2 2023-10-10 08:32:44,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=15.0 2023-10-10 08:32:52,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=300206.6666666667, ans=0.125 2023-10-10 08:33:13,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.653e+02 1.883e+02 2.134e+02 3.924e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-10 08:33:23,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=300346.6666666667, ans=0.0 2023-10-10 08:33:36,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=300393.3333333333, ans=0.0 2023-10-10 08:33:44,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=300440.0, ans=0.04949747468305833 2023-10-10 08:33:47,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=300440.0, ans=0.125 2023-10-10 08:33:57,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=300486.6666666667, ans=0.125 2023-10-10 08:34:01,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300486.6666666667, ans=0.1 2023-10-10 08:34:18,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=300580.0, ans=0.0 2023-10-10 08:34:22,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300580.0, ans=0.1 2023-10-10 08:34:25,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=15.0 2023-10-10 08:34:30,473 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:34:40,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300673.3333333333, ans=0.1 2023-10-10 08:35:02,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.736e+02 1.970e+02 2.215e+02 3.451e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-10 08:35:03,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=300766.6666666667, ans=0.125 2023-10-10 08:35:05,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=300766.6666666667, ans=0.125 2023-10-10 08:35:10,061 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-10-10 08:35:46,482 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.13 vs. limit=22.5 2023-10-10 08:35:58,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=301000.0, ans=0.2 2023-10-10 08:36:10,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=301046.6666666667, ans=0.0 2023-10-10 08:36:15,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=301093.3333333333, ans=0.125 2023-10-10 08:36:22,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=301093.3333333333, ans=0.0 2023-10-10 08:36:31,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=301140.0, ans=0.1 2023-10-10 08:36:34,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=301140.0, ans=0.0 2023-10-10 08:36:39,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=301186.6666666667, ans=0.0 2023-10-10 08:36:39,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=301186.6666666667, ans=0.125 2023-10-10 08:36:49,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.695e+02 1.862e+02 2.114e+02 3.045e+02, threshold=3.724e+02, percent-clipped=0.0 2023-10-10 08:37:00,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=301280.0, ans=0.125 2023-10-10 08:37:05,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=301280.0, ans=0.0 2023-10-10 08:37:18,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=301373.3333333333, ans=0.125 2023-10-10 08:37:35,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=301420.0, ans=0.125 2023-10-10 08:37:36,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=301420.0, ans=0.125 2023-10-10 08:38:00,580 INFO [train.py:1031] (0/4) Epoch 5, batch 10000, loss[loss=0.2048, simple_loss=0.2927, pruned_loss=0.05846, over 16938.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3134, pruned_loss=0.07459, over 32501424.19 frames. ], batch size: 77, lr: 7.52e-03, grad_scale: 32.0 2023-10-10 08:38:19,701 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:38:25,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=301653.3333333333, ans=0.125 2023-10-10 08:38:33,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.726e+02 1.926e+02 2.197e+02 3.835e+02, threshold=3.852e+02, percent-clipped=1.0 2023-10-10 08:38:54,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=301793.3333333333, ans=0.2 2023-10-10 08:39:33,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=301933.3333333333, ans=0.0 2023-10-10 08:39:40,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.02 vs. limit=22.5 2023-10-10 08:39:54,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=302026.6666666667, ans=0.0 2023-10-10 08:40:04,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.35 vs. limit=10.0 2023-10-10 08:40:14,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=302120.0, ans=0.125 2023-10-10 08:40:25,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=302166.6666666667, ans=0.125 2023-10-10 08:40:26,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.892e+02 2.164e+02 2.636e+02 3.771e+02, threshold=4.328e+02, percent-clipped=0.0 2023-10-10 08:40:33,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=302213.3333333333, ans=0.125 2023-10-10 08:40:55,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=302306.6666666667, ans=0.125 2023-10-10 08:41:16,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=302400.0, ans=0.0 2023-10-10 08:41:29,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=302446.6666666667, ans=0.0 2023-10-10 08:41:40,079 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-10-10 08:42:09,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=302586.6666666667, ans=0.125 2023-10-10 08:42:16,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=302633.3333333333, ans=0.125 2023-10-10 08:42:18,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.749e+02 2.002e+02 2.228e+02 3.571e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-10 08:42:36,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=302726.6666666667, ans=0.025 2023-10-10 08:42:52,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=302773.3333333333, ans=0.2 2023-10-10 08:43:11,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=15.0 2023-10-10 08:43:18,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-10-10 08:43:39,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=302960.0, ans=0.0 2023-10-10 08:43:59,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.89 vs. limit=10.0 2023-10-10 08:44:09,307 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.754e+02 1.917e+02 2.185e+02 3.651e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-10 08:44:26,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.84 vs. limit=22.5 2023-10-10 08:44:44,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=303240.0, ans=0.0 2023-10-10 08:44:50,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.19 vs. limit=15.0 2023-10-10 08:44:53,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=303286.6666666667, ans=0.2 2023-10-10 08:45:06,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=303333.3333333333, ans=0.0 2023-10-10 08:45:09,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=303333.3333333333, ans=0.125 2023-10-10 08:45:15,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=303380.0, ans=0.125 2023-10-10 08:45:18,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=22.5 2023-10-10 08:45:22,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=303380.0, ans=0.125 2023-10-10 08:45:24,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=303380.0, ans=0.125 2023-10-10 08:45:36,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-10-10 08:45:48,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=303520.0, ans=0.125 2023-10-10 08:46:04,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.745e+02 1.944e+02 2.297e+02 2.983e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-10 08:46:36,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=303706.6666666667, ans=0.125 2023-10-10 08:46:44,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=303753.3333333333, ans=0.5 2023-10-10 08:46:53,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=303753.3333333333, ans=0.125 2023-10-10 08:47:05,813 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.22 vs. limit=15.0 2023-10-10 08:47:17,274 INFO [train.py:1031] (0/4) Epoch 5, batch 10500, loss[loss=0.2115, simple_loss=0.2985, pruned_loss=0.06229, over 16889.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3139, pruned_loss=0.07467, over 32559762.69 frames. ], batch size: 87, lr: 7.50e-03, grad_scale: 32.0 2023-10-10 08:47:21,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=303893.3333333333, ans=0.125 2023-10-10 08:47:22,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=303893.3333333333, ans=0.1 2023-10-10 08:47:24,163 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-10-10 08:47:26,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=303893.3333333333, ans=0.125 2023-10-10 08:47:49,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=304033.3333333333, ans=0.125 2023-10-10 08:47:52,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.710e+02 2.026e+02 2.351e+02 3.946e+02, threshold=4.051e+02, percent-clipped=1.0 2023-10-10 08:47:55,143 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-10 08:47:58,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=304033.3333333333, ans=0.2 2023-10-10 08:48:08,687 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.17 vs. limit=15.0 2023-10-10 08:48:35,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.30 vs. limit=15.0 2023-10-10 08:48:43,508 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.69 vs. limit=15.0 2023-10-10 08:48:46,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-10-10 08:48:49,890 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.37 vs. limit=15.0 2023-10-10 08:49:04,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=304313.3333333333, ans=0.125 2023-10-10 08:49:47,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=304500.0, ans=0.1 2023-10-10 08:49:51,490 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.765e+02 2.005e+02 2.243e+02 3.603e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-10 08:50:05,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=304546.6666666667, ans=0.0 2023-10-10 08:50:31,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=304640.0, ans=0.125 2023-10-10 08:50:56,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=304733.3333333333, ans=0.0 2023-10-10 08:51:09,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=304780.0, ans=0.05 2023-10-10 08:51:09,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=304780.0, ans=0.125 2023-10-10 08:51:13,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=304826.6666666667, ans=0.125 2023-10-10 08:51:36,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=304920.0, ans=0.04949747468305833 2023-10-10 08:51:46,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.820e+02 2.072e+02 2.371e+02 3.456e+02, threshold=4.144e+02, percent-clipped=0.0 2023-10-10 08:52:16,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305106.6666666667, ans=0.1 2023-10-10 08:52:31,157 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:53:03,648 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=12.0 2023-10-10 08:53:08,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=305293.3333333333, ans=0.125 2023-10-10 08:53:11,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=305340.0, ans=0.0 2023-10-10 08:53:27,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.34 vs. limit=15.0 2023-10-10 08:53:34,228 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.816e+02 2.097e+02 2.281e+02 3.481e+02, threshold=4.195e+02, percent-clipped=0.0 2023-10-10 08:53:59,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=305526.6666666667, ans=0.05 2023-10-10 08:54:08,587 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.04 vs. limit=15.0 2023-10-10 08:54:21,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=305620.0, ans=0.125 2023-10-10 08:54:51,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=305760.0, ans=0.125 2023-10-10 08:54:53,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=305760.0, ans=0.025 2023-10-10 08:54:57,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=305760.0, ans=0.0 2023-10-10 08:55:01,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.04 vs. limit=15.0 2023-10-10 08:55:08,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=305806.6666666667, ans=0.0 2023-10-10 08:55:12,230 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.22 vs. limit=10.0 2023-10-10 08:55:18,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=305853.3333333333, ans=0.125 2023-10-10 08:55:23,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.279e+02 1.686e+02 1.922e+02 2.175e+02 3.150e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-10 08:55:24,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=305900.0, ans=0.125 2023-10-10 08:55:24,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=305900.0, ans=0.125 2023-10-10 08:55:25,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=305900.0, ans=0.0 2023-10-10 08:55:37,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=305946.6666666667, ans=0.0 2023-10-10 08:55:54,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=306040.0, ans=0.125 2023-10-10 08:56:02,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=306086.6666666667, ans=0.1 2023-10-10 08:56:06,293 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.37 vs. limit=15.0 2023-10-10 08:56:27,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306180.0, ans=0.1 2023-10-10 08:56:35,327 INFO [train.py:1031] (0/4) Epoch 5, batch 11000, loss[loss=0.2327, simple_loss=0.3218, pruned_loss=0.07176, over 16876.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3139, pruned_loss=0.07473, over 32592020.31 frames. ], batch size: 87, lr: 7.47e-03, grad_scale: 32.0 2023-10-10 08:57:10,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.787e+02 2.089e+02 2.494e+02 3.681e+02, threshold=4.178e+02, percent-clipped=0.0 2023-10-10 08:57:11,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=306366.6666666667, ans=0.0 2023-10-10 08:57:22,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=306413.3333333333, ans=0.0 2023-10-10 08:57:28,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=306460.0, ans=0.125 2023-10-10 08:57:39,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=306506.6666666667, ans=0.0 2023-10-10 08:57:44,789 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 08:57:49,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=306506.6666666667, ans=0.0 2023-10-10 08:57:50,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=306506.6666666667, ans=0.0 2023-10-10 08:58:15,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=306646.6666666667, ans=0.125 2023-10-10 08:58:23,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=306693.3333333333, ans=0.0 2023-10-10 08:58:38,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=8.0 2023-10-10 08:58:39,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=306740.0, ans=0.125 2023-10-10 08:58:47,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-10-10 08:58:49,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=306740.0, ans=0.0 2023-10-10 08:59:05,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=306833.3333333333, ans=0.125 2023-10-10 08:59:06,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.304e+02 1.714e+02 1.986e+02 2.320e+02 3.498e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-10 08:59:35,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306926.6666666667, ans=0.1 2023-10-10 08:59:53,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=307020.0, ans=0.0 2023-10-10 08:59:54,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=307020.0, ans=0.0 2023-10-10 08:59:56,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=307020.0, ans=0.0 2023-10-10 09:00:26,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.95 vs. limit=10.0 2023-10-10 09:00:29,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=307160.0, ans=0.04949747468305833 2023-10-10 09:00:55,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=307300.0, ans=0.125 2023-10-10 09:00:58,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.655e+02 1.862e+02 2.211e+02 3.402e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-10 09:01:03,158 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.11 vs. limit=15.0 2023-10-10 09:01:20,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=307393.3333333333, ans=0.125 2023-10-10 09:01:45,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=307486.6666666667, ans=0.0 2023-10-10 09:01:49,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=307486.6666666667, ans=0.125 2023-10-10 09:01:56,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307533.3333333333, ans=0.1 2023-10-10 09:02:02,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.75 vs. limit=10.0 2023-10-10 09:02:16,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=307626.6666666667, ans=0.0 2023-10-10 09:02:16,679 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2023-10-10 09:02:22,436 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:02:27,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=307673.3333333333, ans=0.2 2023-10-10 09:02:36,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=307720.0, ans=0.0 2023-10-10 09:02:52,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.667e+02 1.872e+02 2.190e+02 3.334e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-10 09:02:52,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=307766.6666666667, ans=0.0 2023-10-10 09:02:52,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=307766.6666666667, ans=0.125 2023-10-10 09:03:05,492 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.19 vs. limit=15.0 2023-10-10 09:03:11,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=307860.0, ans=0.125 2023-10-10 09:03:36,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=307953.3333333333, ans=0.125 2023-10-10 09:04:06,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=308093.3333333333, ans=0.1 2023-10-10 09:04:12,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=308093.3333333333, ans=0.07 2023-10-10 09:04:34,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=308186.6666666667, ans=0.125 2023-10-10 09:04:43,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.841e+02 2.114e+02 2.376e+02 2.991e+02, threshold=4.229e+02, percent-clipped=0.0 2023-10-10 09:04:49,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=308233.3333333333, ans=0.125 2023-10-10 09:04:52,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=308280.0, ans=0.0 2023-10-10 09:04:52,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=308280.0, ans=0.125 2023-10-10 09:04:58,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=308280.0, ans=0.125 2023-10-10 09:05:20,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=308373.3333333333, ans=6.0 2023-10-10 09:05:41,565 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.93 vs. limit=10.0 2023-10-10 09:05:42,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=308466.6666666667, ans=0.0 2023-10-10 09:05:47,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=308513.3333333333, ans=0.0 2023-10-10 09:05:55,074 INFO [train.py:1031] (0/4) Epoch 5, batch 11500, loss[loss=0.2686, simple_loss=0.3184, pruned_loss=0.1094, over 12434.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3137, pruned_loss=0.07456, over 32654852.70 frames. ], batch size: 440, lr: 7.44e-03, grad_scale: 32.0 2023-10-10 09:06:10,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=308606.6666666667, ans=0.2 2023-10-10 09:06:25,745 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.60 vs. limit=15.0 2023-10-10 09:06:30,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.867e+02 1.980e+02 2.225e+02 2.951e+02, threshold=3.960e+02, percent-clipped=0.0 2023-10-10 09:06:42,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308746.6666666667, ans=0.1 2023-10-10 09:06:56,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=308793.3333333333, ans=0.125 2023-10-10 09:06:58,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=308793.3333333333, ans=0.125 2023-10-10 09:07:04,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=308840.0, ans=0.2 2023-10-10 09:07:07,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=308840.0, ans=0.0 2023-10-10 09:07:26,226 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:07:27,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=308933.3333333333, ans=0.125 2023-10-10 09:07:29,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=308933.3333333333, ans=0.0 2023-10-10 09:07:29,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=308933.3333333333, ans=0.0 2023-10-10 09:07:36,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308933.3333333333, ans=0.1 2023-10-10 09:07:49,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=308980.0, ans=10.0 2023-10-10 09:07:59,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=309026.6666666667, ans=0.0 2023-10-10 09:08:03,637 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.26 vs. limit=15.0 2023-10-10 09:08:15,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=309120.0, ans=0.125 2023-10-10 09:08:32,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.687e+02 1.838e+02 2.019e+02 3.151e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-10 09:09:09,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=309306.6666666667, ans=0.125 2023-10-10 09:09:24,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=309400.0, ans=0.125 2023-10-10 09:09:52,847 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.85 vs. limit=22.5 2023-10-10 09:09:58,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=309540.0, ans=0.125 2023-10-10 09:10:19,601 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.801e+02 2.107e+02 2.332e+02 3.479e+02, threshold=4.215e+02, percent-clipped=0.0 2023-10-10 09:10:35,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=309680.0, ans=0.2 2023-10-10 09:10:37,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=309680.0, ans=0.125 2023-10-10 09:11:12,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=309820.0, ans=0.0 2023-10-10 09:11:26,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.53 vs. limit=15.0 2023-10-10 09:11:41,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=309913.3333333333, ans=0.0 2023-10-10 09:11:42,164 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.45 vs. limit=15.0 2023-10-10 09:11:47,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=309960.0, ans=0.125 2023-10-10 09:11:54,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=309960.0, ans=0.2 2023-10-10 09:11:57,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.52 vs. limit=15.0 2023-10-10 09:12:04,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=310006.6666666667, ans=0.0 2023-10-10 09:12:13,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=310053.3333333333, ans=0.0 2023-10-10 09:12:13,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=310053.3333333333, ans=0.2 2023-10-10 09:12:16,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=310053.3333333333, ans=0.125 2023-10-10 09:12:25,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=310100.0, ans=0.0 2023-10-10 09:12:27,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.695e+02 1.871e+02 2.202e+02 3.745e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-10 09:12:41,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=310146.6666666667, ans=0.125 2023-10-10 09:12:43,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.47 vs. limit=15.0 2023-10-10 09:13:09,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=310286.6666666667, ans=0.0 2023-10-10 09:13:17,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=310333.3333333333, ans=0.0 2023-10-10 09:13:27,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=310333.3333333333, ans=0.2 2023-10-10 09:13:34,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=310380.0, ans=0.125 2023-10-10 09:13:39,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=310380.0, ans=0.0 2023-10-10 09:13:50,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=310426.6666666667, ans=0.0 2023-10-10 09:13:52,002 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.54 vs. limit=10.0 2023-10-10 09:13:55,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.84 vs. limit=22.5 2023-10-10 09:13:58,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=310473.3333333333, ans=0.2 2023-10-10 09:14:02,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=310473.3333333333, ans=0.125 2023-10-10 09:14:07,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=310520.0, ans=0.125 2023-10-10 09:14:21,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.731e+02 1.956e+02 2.274e+02 4.291e+02, threshold=3.913e+02, percent-clipped=3.0 2023-10-10 09:14:24,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=310566.6666666667, ans=0.05 2023-10-10 09:14:32,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=310613.3333333333, ans=0.125 2023-10-10 09:14:32,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=310613.3333333333, ans=0.125 2023-10-10 09:14:37,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.87 vs. limit=10.0 2023-10-10 09:14:40,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.22 vs. limit=10.0 2023-10-10 09:14:49,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-10-10 09:14:55,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=310706.6666666667, ans=0.125 2023-10-10 09:15:02,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=310753.3333333333, ans=0.0 2023-10-10 09:15:02,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=310753.3333333333, ans=0.125 2023-10-10 09:15:07,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=310753.3333333333, ans=0.2 2023-10-10 09:15:26,696 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-10-10 09:15:28,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=310846.6666666667, ans=0.125 2023-10-10 09:15:32,694 INFO [train.py:1031] (0/4) Epoch 5, batch 12000, loss[loss=0.2466, simple_loss=0.3273, pruned_loss=0.08295, over 16927.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3136, pruned_loss=0.07418, over 32689840.63 frames. ], batch size: 138, lr: 7.41e-03, grad_scale: 32.0 2023-10-10 09:15:46,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-10-10 09:15:46,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=310940.0, ans=0.125 2023-10-10 09:15:52,352 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-10-10 09:16:09,360 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.693e+02 1.865e+02 2.201e+02 3.007e+02, threshold=3.730e+02, percent-clipped=0.0 2023-10-10 09:16:25,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=311080.0, ans=0.0 2023-10-10 09:16:31,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=311126.6666666667, ans=0.125 2023-10-10 09:16:48,235 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-10-10 09:17:02,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=311266.6666666667, ans=0.125 2023-10-10 09:17:05,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=311266.6666666667, ans=0.125 2023-10-10 09:17:06,455 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.77 vs. limit=15.0 2023-10-10 09:17:26,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.17 vs. limit=12.0 2023-10-10 09:17:31,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=311360.0, ans=0.125 2023-10-10 09:17:36,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=311406.6666666667, ans=0.0 2023-10-10 09:18:02,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.809e+02 2.102e+02 2.599e+02 4.198e+02, threshold=4.205e+02, percent-clipped=2.0 2023-10-10 09:18:05,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=311500.0, ans=0.0 2023-10-10 09:18:09,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.54 vs. limit=15.0 2023-10-10 09:18:36,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=311686.6666666667, ans=0.125 2023-10-10 09:18:51,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=311733.3333333333, ans=0.2 2023-10-10 09:18:57,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311780.0, ans=0.1 2023-10-10 09:19:30,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=311920.0, ans=0.2 2023-10-10 09:19:41,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=311966.6666666667, ans=0.2 2023-10-10 09:19:44,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.795e+02 1.983e+02 2.315e+02 3.227e+02, threshold=3.967e+02, percent-clipped=0.0 2023-10-10 09:20:09,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=312060.0, ans=0.125 2023-10-10 09:21:12,780 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=12.0 2023-10-10 09:21:13,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.16 vs. limit=22.5 2023-10-10 09:21:16,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=312386.6666666667, ans=0.05 2023-10-10 09:21:31,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.829e+02 2.024e+02 2.357e+02 3.379e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-10 09:21:42,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=312480.0, ans=0.125 2023-10-10 09:21:46,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=312480.0, ans=0.1 2023-10-10 09:22:05,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=312573.3333333333, ans=0.125 2023-10-10 09:22:06,803 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.62 vs. limit=22.5 2023-10-10 09:22:07,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=312573.3333333333, ans=0.2 2023-10-10 09:22:13,446 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=8.0 2023-10-10 09:22:27,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=312666.6666666667, ans=0.0 2023-10-10 09:22:41,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-10-10 09:22:42,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.07 vs. limit=15.0 2023-10-10 09:23:05,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=312853.3333333333, ans=0.0 2023-10-10 09:23:06,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=312853.3333333333, ans=0.0 2023-10-10 09:23:19,488 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.843e+02 1.953e+02 2.196e+02 2.912e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-10 09:23:34,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-10-10 09:23:36,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=312993.3333333333, ans=0.125 2023-10-10 09:23:42,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=312993.3333333333, ans=0.05 2023-10-10 09:24:04,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=313086.6666666667, ans=0.0 2023-10-10 09:24:33,946 INFO [train.py:1031] (0/4) Epoch 5, batch 12500, loss[loss=0.2362, simple_loss=0.3202, pruned_loss=0.07607, over 16966.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3131, pruned_loss=0.07405, over 32715439.82 frames. ], batch size: 117, lr: 7.39e-03, grad_scale: 32.0 2023-10-10 09:24:45,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=313273.3333333333, ans=10.0 2023-10-10 09:24:52,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=313273.3333333333, ans=0.0 2023-10-10 09:24:52,398 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.51 vs. limit=15.0 2023-10-10 09:24:57,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=313320.0, ans=0.0 2023-10-10 09:24:59,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=313320.0, ans=0.125 2023-10-10 09:25:10,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.744e+02 1.955e+02 2.361e+02 3.239e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-10 09:25:26,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=313460.0, ans=0.0 2023-10-10 09:25:44,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=313506.6666666667, ans=0.125 2023-10-10 09:25:44,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=313506.6666666667, ans=0.125 2023-10-10 09:25:48,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=313553.3333333333, ans=0.2 2023-10-10 09:26:15,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=313646.6666666667, ans=0.2 2023-10-10 09:26:22,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.09 vs. limit=15.0 2023-10-10 09:26:29,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=313693.3333333333, ans=0.2 2023-10-10 09:26:35,356 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-10-10 09:26:36,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=313740.0, ans=0.2 2023-10-10 09:26:46,041 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.93 vs. limit=15.0 2023-10-10 09:26:59,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=313833.3333333333, ans=0.125 2023-10-10 09:27:00,440 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.305e+02 1.697e+02 1.906e+02 2.139e+02 3.055e+02, threshold=3.813e+02, percent-clipped=0.0 2023-10-10 09:27:44,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=314020.0, ans=0.09899494936611666 2023-10-10 09:27:51,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314066.6666666667, ans=0.125 2023-10-10 09:28:33,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=314253.3333333333, ans=0.0 2023-10-10 09:28:43,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=314300.0, ans=0.125 2023-10-10 09:28:49,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.756e+02 2.008e+02 2.268e+02 4.172e+02, threshold=4.016e+02, percent-clipped=2.0 2023-10-10 09:28:51,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=314300.0, ans=0.125 2023-10-10 09:28:55,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=314346.6666666667, ans=0.0 2023-10-10 09:28:59,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.89 vs. limit=22.5 2023-10-10 09:29:00,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=15.0 2023-10-10 09:29:25,224 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.32 vs. limit=15.0 2023-10-10 09:29:35,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314486.6666666667, ans=0.125 2023-10-10 09:29:43,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=314533.3333333333, ans=0.125 2023-10-10 09:29:52,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=314580.0, ans=0.125 2023-10-10 09:30:00,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=314626.6666666667, ans=0.125 2023-10-10 09:30:14,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=314673.3333333333, ans=0.0 2023-10-10 09:30:19,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=314720.0, ans=0.0 2023-10-10 09:30:27,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=314720.0, ans=0.1 2023-10-10 09:30:29,103 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.77 vs. limit=15.0 2023-10-10 09:30:29,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=314766.6666666667, ans=10.0 2023-10-10 09:30:35,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.754e+02 1.941e+02 2.289e+02 4.469e+02, threshold=3.883e+02, percent-clipped=1.0 2023-10-10 09:30:41,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=314813.3333333333, ans=0.125 2023-10-10 09:31:19,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=314953.3333333333, ans=0.0 2023-10-10 09:31:30,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=315000.0, ans=0.125 2023-10-10 09:31:31,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.06 vs. limit=22.5 2023-10-10 09:31:51,489 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.49 vs. limit=5.0 2023-10-10 09:31:52,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=315093.3333333333, ans=10.0 2023-10-10 09:32:09,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=315186.6666666667, ans=0.0 2023-10-10 09:32:22,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=315233.3333333333, ans=0.125 2023-10-10 09:32:22,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=315233.3333333333, ans=0.125 2023-10-10 09:32:22,158 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.57 vs. limit=15.0 2023-10-10 09:32:25,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=315233.3333333333, ans=0.1 2023-10-10 09:32:26,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.699e+02 1.925e+02 2.188e+02 3.283e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-10 09:33:04,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=315420.0, ans=0.05 2023-10-10 09:33:08,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=315420.0, ans=0.0 2023-10-10 09:33:23,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.07 vs. limit=10.0 2023-10-10 09:33:32,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=315513.3333333333, ans=0.125 2023-10-10 09:33:34,875 INFO [train.py:1031] (0/4) Epoch 5, batch 13000, loss[loss=0.2089, simple_loss=0.2972, pruned_loss=0.06027, over 16372.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.314, pruned_loss=0.07427, over 32719399.37 frames. ], batch size: 50, lr: 7.36e-03, grad_scale: 32.0 2023-10-10 09:33:59,407 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.81 vs. limit=6.0 2023-10-10 09:34:00,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=315653.3333333333, ans=0.0 2023-10-10 09:34:03,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=315653.3333333333, ans=0.125 2023-10-10 09:34:04,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=315653.3333333333, ans=0.0 2023-10-10 09:34:12,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-10-10 09:34:17,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.15 vs. limit=15.0 2023-10-10 09:34:19,738 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.805e+02 1.942e+02 2.223e+02 3.286e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-10 09:34:23,320 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=15.0 2023-10-10 09:34:34,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=315746.6666666667, ans=0.1 2023-10-10 09:34:43,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=315793.3333333333, ans=0.125 2023-10-10 09:34:58,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=315840.0, ans=0.0 2023-10-10 09:34:59,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=315886.6666666667, ans=0.125 2023-10-10 09:35:17,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=315933.3333333333, ans=0.125 2023-10-10 09:35:21,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=315980.0, ans=0.0 2023-10-10 09:35:21,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.30 vs. limit=22.5 2023-10-10 09:35:25,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=315980.0, ans=0.125 2023-10-10 09:35:26,665 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=15.0 2023-10-10 09:35:30,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=315980.0, ans=22.5 2023-10-10 09:35:37,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=316026.6666666667, ans=0.125 2023-10-10 09:35:41,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=316026.6666666667, ans=0.0 2023-10-10 09:35:58,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=316120.0, ans=0.2 2023-10-10 09:36:10,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.786e+02 2.080e+02 2.370e+02 3.823e+02, threshold=4.159e+02, percent-clipped=0.0 2023-10-10 09:36:11,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=316166.6666666667, ans=0.1 2023-10-10 09:36:12,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.51 vs. limit=12.0 2023-10-10 09:36:22,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=15.0 2023-10-10 09:36:33,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.53 vs. limit=15.0 2023-10-10 09:36:47,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.60 vs. limit=10.0 2023-10-10 09:37:10,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=316400.0, ans=0.125 2023-10-10 09:37:16,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=316446.6666666667, ans=0.125 2023-10-10 09:37:18,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=316446.6666666667, ans=0.125 2023-10-10 09:37:41,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=316540.0, ans=0.0 2023-10-10 09:37:49,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316586.6666666667, ans=0.1 2023-10-10 09:37:53,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=316586.6666666667, ans=0.125 2023-10-10 09:38:04,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.704e+02 1.903e+02 2.308e+02 2.940e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-10 09:38:19,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=316726.6666666667, ans=0.125 2023-10-10 09:38:35,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316773.3333333333, ans=0.1 2023-10-10 09:38:47,886 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-10 09:38:48,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316820.0, ans=0.1 2023-10-10 09:38:49,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=316820.0, ans=0.125 2023-10-10 09:38:57,612 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.69 vs. limit=22.5 2023-10-10 09:39:03,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-10-10 09:39:08,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=316913.3333333333, ans=0.0 2023-10-10 09:39:12,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=316913.3333333333, ans=0.2 2023-10-10 09:39:13,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=316913.3333333333, ans=0.025 2023-10-10 09:39:17,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=316913.3333333333, ans=0.0 2023-10-10 09:39:34,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=317006.6666666667, ans=0.125 2023-10-10 09:39:46,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-10-10 09:39:51,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=317100.0, ans=0.125 2023-10-10 09:39:57,755 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.923e+02 2.163e+02 2.420e+02 3.729e+02, threshold=4.325e+02, percent-clipped=0.0 2023-10-10 09:40:07,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=317146.6666666667, ans=0.1 2023-10-10 09:40:09,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=317146.6666666667, ans=0.125 2023-10-10 09:40:11,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=317193.3333333333, ans=0.125 2023-10-10 09:40:28,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=317240.0, ans=0.125 2023-10-10 09:40:30,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=317240.0, ans=0.125 2023-10-10 09:40:50,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=317333.3333333333, ans=0.2 2023-10-10 09:40:54,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=317380.0, ans=0.0 2023-10-10 09:40:58,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=317380.0, ans=0.09899494936611666 2023-10-10 09:41:14,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=317426.6666666667, ans=0.0 2023-10-10 09:41:19,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.59 vs. limit=15.0 2023-10-10 09:41:40,094 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-10-10 09:41:46,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.829e+02 2.127e+02 2.596e+02 4.066e+02, threshold=4.254e+02, percent-clipped=0.0 2023-10-10 09:41:47,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=317566.6666666667, ans=0.125 2023-10-10 09:41:48,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=317566.6666666667, ans=0.0 2023-10-10 09:41:50,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=317613.3333333333, ans=0.0 2023-10-10 09:41:56,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=317613.3333333333, ans=0.125 2023-10-10 09:41:56,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=317613.3333333333, ans=0.125 2023-10-10 09:42:06,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=317660.0, ans=0.125 2023-10-10 09:42:06,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=317660.0, ans=0.1 2023-10-10 09:42:07,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=317660.0, ans=0.125 2023-10-10 09:42:13,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=317706.6666666667, ans=0.0 2023-10-10 09:42:19,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=317706.6666666667, ans=0.0 2023-10-10 09:42:46,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=317846.6666666667, ans=0.125 2023-10-10 09:42:52,204 INFO [train.py:1031] (0/4) Epoch 5, batch 13500, loss[loss=0.2717, simple_loss=0.3353, pruned_loss=0.104, over 15427.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3131, pruned_loss=0.07379, over 32746711.65 frames. ], batch size: 35, lr: 7.33e-03, grad_scale: 32.0 2023-10-10 09:42:54,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=317893.3333333333, ans=0.125 2023-10-10 09:43:09,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=317940.0, ans=0.0 2023-10-10 09:43:31,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.700e+02 1.878e+02 2.171e+02 3.207e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-10 09:43:31,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=318033.3333333333, ans=0.0 2023-10-10 09:43:33,947 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.96 vs. limit=22.5 2023-10-10 09:43:41,730 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.38 vs. limit=15.0 2023-10-10 09:43:45,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.29 vs. limit=6.0 2023-10-10 09:43:53,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=318126.6666666667, ans=0.0 2023-10-10 09:43:56,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=318173.3333333333, ans=0.125 2023-10-10 09:43:58,625 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 09:44:04,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=318173.3333333333, ans=0.2 2023-10-10 09:44:09,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=318220.0, ans=0.0 2023-10-10 09:44:43,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318360.0, ans=0.1 2023-10-10 09:44:49,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=318360.0, ans=0.125 2023-10-10 09:45:02,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=318453.3333333333, ans=0.2 2023-10-10 09:45:14,637 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.804e+02 1.958e+02 2.191e+02 3.104e+02, threshold=3.916e+02, percent-clipped=0.0 2023-10-10 09:45:27,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=318593.3333333333, ans=0.125 2023-10-10 09:45:32,910 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-5.pt 2023-10-10 09:45:59,580 INFO [train.py:1031] (0/4) Epoch 6, batch 0, loss[loss=0.2064, simple_loss=0.2933, pruned_loss=0.05972, over 16959.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2933, pruned_loss=0.05972, over 16959.00 frames. ], batch size: 93, lr: 6.59e-03, grad_scale: 32.0 2023-10-10 09:45:59,581 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-10 09:46:07,216 INFO [train.py:1063] (0/4) Epoch 6, validation: loss=0.2342, simple_loss=0.321, pruned_loss=0.07365, over 1020973.00 frames. 2023-10-10 09:46:07,216 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-10 09:46:11,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=318616.6666666667, ans=0.0 2023-10-10 09:46:20,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=318663.3333333333, ans=0.125 2023-10-10 09:46:21,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=318663.3333333333, ans=0.0 2023-10-10 09:46:35,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=318710.0, ans=0.125 2023-10-10 09:46:35,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=318710.0, ans=0.2 2023-10-10 09:46:41,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=318756.6666666667, ans=0.125 2023-10-10 09:46:50,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=318756.6666666667, ans=0.125 2023-10-10 09:46:52,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=318803.3333333333, ans=0.2 2023-10-10 09:46:56,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.91 vs. limit=22.5 2023-10-10 09:47:09,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=318850.0, ans=0.07 2023-10-10 09:47:23,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=318896.6666666667, ans=0.125 2023-10-10 09:47:30,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=318943.3333333333, ans=0.1 2023-10-10 09:47:38,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318990.0, ans=0.1 2023-10-10 09:47:40,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.657e+02 1.844e+02 2.142e+02 3.757e+02, threshold=3.688e+02, percent-clipped=0.0 2023-10-10 09:47:45,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=318990.0, ans=0.125 2023-10-10 09:47:53,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=319036.6666666667, ans=0.0 2023-10-10 09:48:07,572 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.46 vs. limit=15.0 2023-10-10 09:48:16,406 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.05 vs. limit=15.0 2023-10-10 09:48:52,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=319270.0, ans=0.0 2023-10-10 09:49:21,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=319410.0, ans=0.2 2023-10-10 09:49:24,460 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.48 vs. limit=15.0 2023-10-10 09:49:26,828 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.17 vs. limit=12.0 2023-10-10 09:49:27,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.317e+02 1.653e+02 1.834e+02 2.114e+02 3.524e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-10 09:49:58,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=319596.6666666667, ans=0.0 2023-10-10 09:50:03,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=319596.6666666667, ans=0.0 2023-10-10 09:50:38,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=319736.6666666667, ans=0.2 2023-10-10 09:50:42,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=319736.6666666667, ans=10.0 2023-10-10 09:50:52,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=319783.3333333333, ans=0.1 2023-10-10 09:50:58,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=319830.0, ans=0.125 2023-10-10 09:51:19,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.707e+02 1.930e+02 2.182e+02 3.631e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-10 09:51:20,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=319923.3333333333, ans=0.0 2023-10-10 09:51:22,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=319923.3333333333, ans=0.2 2023-10-10 09:51:37,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=319970.0, ans=0.1 2023-10-10 09:51:42,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=320016.6666666667, ans=0.125 2023-10-10 09:51:59,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=320063.3333333333, ans=0.125 2023-10-10 09:52:16,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.47 vs. limit=5.0 2023-10-10 09:52:16,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=320156.6666666667, ans=0.125 2023-10-10 09:52:17,744 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-10-10 09:52:22,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=320156.6666666667, ans=0.125 2023-10-10 09:52:30,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=320203.3333333333, ans=0.125 2023-10-10 09:52:55,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=320343.3333333333, ans=0.1 2023-10-10 09:53:06,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.707e+02 1.948e+02 2.239e+02 3.046e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-10 09:53:07,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=320390.0, ans=0.0 2023-10-10 09:53:08,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=320390.0, ans=0.0 2023-10-10 09:53:14,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=320390.0, ans=0.125 2023-10-10 09:53:29,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=320483.3333333333, ans=0.2 2023-10-10 09:53:44,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=320530.0, ans=10.0 2023-10-10 09:53:58,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=320576.6666666667, ans=0.125 2023-10-10 09:54:03,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=320623.3333333333, ans=0.125 2023-10-10 09:54:22,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=320716.6666666667, ans=0.1 2023-10-10 09:54:22,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-10-10 09:54:58,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.721e+02 1.871e+02 1.985e+02 2.940e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-10 09:55:21,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=320950.0, ans=0.0 2023-10-10 09:55:21,727 INFO [train.py:1031] (0/4) Epoch 6, batch 500, loss[loss=0.2062, simple_loss=0.2913, pruned_loss=0.06058, over 16834.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3107, pruned_loss=0.07191, over 7285835.74 frames. ], batch size: 146, lr: 6.56e-03, grad_scale: 32.0 2023-10-10 09:55:30,954 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2023-10-10 09:55:32,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=320996.6666666667, ans=0.2 2023-10-10 09:55:44,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=321043.3333333333, ans=22.5 2023-10-10 09:55:49,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=321043.3333333333, ans=0.0 2023-10-10 09:56:16,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=321183.3333333333, ans=0.2 2023-10-10 09:56:41,870 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.45 vs. limit=22.5 2023-10-10 09:56:49,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.832e+02 1.980e+02 2.406e+02 3.202e+02, threshold=3.960e+02, percent-clipped=0.0 2023-10-10 09:56:57,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=321323.3333333333, ans=0.125 2023-10-10 09:57:03,564 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.75 vs. limit=15.0 2023-10-10 09:57:04,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=321370.0, ans=0.125 2023-10-10 09:57:13,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.81 vs. limit=6.0 2023-10-10 09:57:19,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=321463.3333333333, ans=0.0 2023-10-10 09:57:25,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=321463.3333333333, ans=22.5 2023-10-10 09:57:26,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=321463.3333333333, ans=0.125 2023-10-10 09:57:29,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=321463.3333333333, ans=0.07 2023-10-10 09:57:39,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=321510.0, ans=0.125 2023-10-10 09:57:50,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-10-10 09:57:56,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=321603.3333333333, ans=0.125 2023-10-10 09:57:57,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=321603.3333333333, ans=0.125 2023-10-10 09:58:02,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=321650.0, ans=0.0 2023-10-10 09:58:15,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=321696.6666666667, ans=0.05 2023-10-10 09:58:17,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=321696.6666666667, ans=0.125 2023-10-10 09:58:36,474 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.785e+02 2.073e+02 2.409e+02 3.539e+02, threshold=4.145e+02, percent-clipped=0.0 2023-10-10 09:58:42,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=321790.0, ans=0.0 2023-10-10 09:58:42,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=321790.0, ans=0.0 2023-10-10 09:58:56,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=321883.3333333333, ans=0.2 2023-10-10 09:59:22,330 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.71 vs. limit=6.0 2023-10-10 09:59:25,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322023.3333333333, ans=0.1 2023-10-10 09:59:25,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=322023.3333333333, ans=0.0 2023-10-10 09:59:43,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=322070.0, ans=0.0 2023-10-10 09:59:52,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=322116.6666666667, ans=0.1 2023-10-10 09:59:57,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=322116.6666666667, ans=0.07 2023-10-10 10:00:06,357 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.44 vs. limit=22.5 2023-10-10 10:00:25,978 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.766e+02 2.044e+02 2.347e+02 3.652e+02, threshold=4.089e+02, percent-clipped=0.0 2023-10-10 10:00:28,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=322256.6666666667, ans=0.0 2023-10-10 10:00:58,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-10-10 10:01:02,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=322396.6666666667, ans=0.125 2023-10-10 10:01:03,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=322396.6666666667, ans=0.2 2023-10-10 10:01:05,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=322396.6666666667, ans=0.125 2023-10-10 10:01:22,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322490.0, ans=0.1 2023-10-10 10:01:49,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-10-10 10:02:00,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=22.5 2023-10-10 10:02:23,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.734e+02 1.942e+02 2.225e+02 3.170e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-10 10:02:26,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322723.3333333333, ans=0.1 2023-10-10 10:02:42,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=322816.6666666667, ans=0.05 2023-10-10 10:02:42,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=322816.6666666667, ans=0.125 2023-10-10 10:02:46,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=322816.6666666667, ans=0.025 2023-10-10 10:02:48,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=322816.6666666667, ans=0.125 2023-10-10 10:02:53,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=322863.3333333333, ans=0.0 2023-10-10 10:03:06,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=322910.0, ans=0.125 2023-10-10 10:03:15,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=322910.0, ans=0.04949747468305833 2023-10-10 10:03:19,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=322956.6666666667, ans=0.125 2023-10-10 10:03:22,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.52 vs. limit=15.0 2023-10-10 10:03:23,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=322956.6666666667, ans=0.5 2023-10-10 10:03:28,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=323003.3333333333, ans=0.2 2023-10-10 10:03:47,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=323050.0, ans=0.125 2023-10-10 10:03:50,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=323050.0, ans=0.0 2023-10-10 10:04:12,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=323143.3333333333, ans=0.125 2023-10-10 10:04:16,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.682e+02 1.886e+02 2.092e+02 3.152e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-10 10:04:38,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=323283.3333333333, ans=0.125 2023-10-10 10:04:39,135 INFO [train.py:1031] (0/4) Epoch 6, batch 1000, loss[loss=0.2252, simple_loss=0.3058, pruned_loss=0.07234, over 16608.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3119, pruned_loss=0.07259, over 12933614.30 frames. ], batch size: 241, lr: 6.54e-03, grad_scale: 32.0 2023-10-10 10:04:39,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=323283.3333333333, ans=0.125 2023-10-10 10:04:48,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=323283.3333333333, ans=0.0 2023-10-10 10:04:58,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=323330.0, ans=0.2 2023-10-10 10:05:00,994 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.56 vs. limit=15.0 2023-10-10 10:05:29,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=323470.0, ans=0.1 2023-10-10 10:05:37,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=323516.6666666667, ans=0.125 2023-10-10 10:05:42,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=323563.3333333333, ans=0.125 2023-10-10 10:05:43,404 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.16 vs. limit=15.0 2023-10-10 10:05:45,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=323563.3333333333, ans=0.2 2023-10-10 10:06:04,592 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.757e+02 2.017e+02 2.290e+02 3.041e+02, threshold=4.034e+02, percent-clipped=0.0 2023-10-10 10:06:30,064 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:07:01,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=323843.3333333333, ans=0.2 2023-10-10 10:07:16,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=323936.6666666667, ans=0.0 2023-10-10 10:07:22,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=323936.6666666667, ans=0.125 2023-10-10 10:07:22,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=323936.6666666667, ans=0.1 2023-10-10 10:07:23,845 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.04 vs. limit=15.0 2023-10-10 10:07:25,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=323983.3333333333, ans=0.125 2023-10-10 10:07:29,626 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-10 10:07:42,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=324030.0, ans=0.2 2023-10-10 10:08:01,773 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:08:05,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.739e+02 1.896e+02 2.112e+02 3.425e+02, threshold=3.791e+02, percent-clipped=0.0 2023-10-10 10:08:05,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=324123.3333333333, ans=0.125 2023-10-10 10:08:31,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=324216.6666666667, ans=0.0 2023-10-10 10:08:34,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=324216.6666666667, ans=0.0 2023-10-10 10:08:37,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=324263.3333333333, ans=0.125 2023-10-10 10:08:54,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324310.0, ans=0.1 2023-10-10 10:09:02,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=324356.6666666667, ans=0.1 2023-10-10 10:09:04,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=324356.6666666667, ans=0.125 2023-10-10 10:09:06,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.26 vs. limit=15.0 2023-10-10 10:09:32,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=324496.6666666667, ans=0.0 2023-10-10 10:09:48,347 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.50 vs. limit=15.0 2023-10-10 10:09:51,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.673e+02 1.943e+02 2.176e+02 2.989e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-10 10:09:59,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=324636.6666666667, ans=0.125 2023-10-10 10:10:04,076 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.45 vs. limit=10.0 2023-10-10 10:10:13,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-10-10 10:10:32,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=324776.6666666667, ans=0.125 2023-10-10 10:10:43,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=324823.3333333333, ans=0.125 2023-10-10 10:10:50,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=324823.3333333333, ans=0.125 2023-10-10 10:10:56,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=324870.0, ans=0.0 2023-10-10 10:11:03,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=324870.0, ans=0.2 2023-10-10 10:11:06,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=324916.6666666667, ans=0.2 2023-10-10 10:11:33,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=325010.0, ans=0.0 2023-10-10 10:11:41,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.687e+02 1.892e+02 2.269e+02 3.526e+02, threshold=3.783e+02, percent-clipped=0.0 2023-10-10 10:11:41,656 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.03 vs. limit=15.0 2023-10-10 10:11:42,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=325056.6666666667, ans=0.07 2023-10-10 10:11:44,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=325056.6666666667, ans=0.0 2023-10-10 10:12:00,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=325150.0, ans=0.2 2023-10-10 10:12:09,129 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=12.0 2023-10-10 10:12:20,137 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:12:45,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=325336.6666666667, ans=0.125 2023-10-10 10:12:58,087 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.43 vs. limit=15.0 2023-10-10 10:12:58,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=325383.3333333333, ans=0.125 2023-10-10 10:13:12,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=325430.0, ans=0.125 2023-10-10 10:13:12,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=325430.0, ans=0.125 2023-10-10 10:13:12,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=325430.0, ans=0.125 2023-10-10 10:13:32,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.812e+02 2.015e+02 2.234e+02 3.774e+02, threshold=4.030e+02, percent-clipped=0.0 2023-10-10 10:13:36,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=325523.3333333333, ans=0.125 2023-10-10 10:13:56,595 INFO [train.py:1031] (0/4) Epoch 6, batch 1500, loss[loss=0.2206, simple_loss=0.304, pruned_loss=0.06858, over 16097.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3096, pruned_loss=0.07106, over 17350835.28 frames. ], batch size: 296, lr: 6.51e-03, grad_scale: 32.0 2023-10-10 10:14:02,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=325616.6666666667, ans=0.125 2023-10-10 10:14:16,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=325663.3333333333, ans=0.125 2023-10-10 10:14:39,301 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-10-10 10:14:40,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325803.3333333333, ans=0.1 2023-10-10 10:14:44,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.58 vs. limit=22.5 2023-10-10 10:14:46,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=325803.3333333333, ans=0.125 2023-10-10 10:14:47,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=325803.3333333333, ans=0.2 2023-10-10 10:14:52,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=325850.0, ans=0.07 2023-10-10 10:15:26,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.253e+02 1.722e+02 1.938e+02 2.277e+02 3.511e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-10 10:15:36,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=326036.6666666667, ans=0.2 2023-10-10 10:15:52,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.84 vs. limit=12.0 2023-10-10 10:16:00,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=326130.0, ans=0.1 2023-10-10 10:16:00,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=326130.0, ans=0.0 2023-10-10 10:16:03,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=326130.0, ans=0.125 2023-10-10 10:16:03,565 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.21 vs. limit=15.0 2023-10-10 10:16:06,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=326130.0, ans=0.125 2023-10-10 10:16:55,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=326316.6666666667, ans=0.125 2023-10-10 10:17:20,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.691e+02 1.880e+02 2.078e+02 2.981e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-10 10:17:23,135 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.84 vs. limit=12.0 2023-10-10 10:17:25,115 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-10-10 10:17:27,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=326456.6666666667, ans=0.1 2023-10-10 10:17:46,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=326550.0, ans=0.0 2023-10-10 10:18:07,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=326643.3333333333, ans=0.2 2023-10-10 10:18:28,628 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.41 vs. limit=15.0 2023-10-10 10:18:31,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=326736.6666666667, ans=0.1 2023-10-10 10:18:34,132 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.60 vs. limit=15.0 2023-10-10 10:18:38,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=326783.3333333333, ans=0.125 2023-10-10 10:18:43,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=326830.0, ans=0.125 2023-10-10 10:18:44,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=326830.0, ans=0.125 2023-10-10 10:18:59,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=326876.6666666667, ans=0.125 2023-10-10 10:19:06,876 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.736e+02 1.903e+02 2.095e+02 2.601e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 10:19:08,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=326923.3333333333, ans=0.1 2023-10-10 10:19:30,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=327016.6666666667, ans=0.025 2023-10-10 10:19:32,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=327016.6666666667, ans=0.125 2023-10-10 10:20:01,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.48 vs. limit=22.5 2023-10-10 10:20:39,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=15.0 2023-10-10 10:20:58,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.676e+02 1.844e+02 2.137e+02 2.875e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-10 10:21:21,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=327483.3333333333, ans=0.125 2023-10-10 10:21:26,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=327483.3333333333, ans=0.0 2023-10-10 10:21:39,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-10-10 10:21:40,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=327576.6666666667, ans=0.125 2023-10-10 10:21:51,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=327623.3333333333, ans=0.125 2023-10-10 10:21:55,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=327623.3333333333, ans=0.125 2023-10-10 10:22:09,794 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=22.5 2023-10-10 10:22:19,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=327716.6666666667, ans=0.0 2023-10-10 10:22:56,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.33 vs. limit=22.5 2023-10-10 10:22:58,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.278e+02 1.663e+02 1.822e+02 2.048e+02 2.846e+02, threshold=3.645e+02, percent-clipped=0.0 2023-10-10 10:23:07,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=327903.3333333333, ans=0.0 2023-10-10 10:23:08,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=327903.3333333333, ans=0.0 2023-10-10 10:23:10,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=327903.3333333333, ans=0.2 2023-10-10 10:23:14,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=327903.3333333333, ans=22.5 2023-10-10 10:23:20,882 INFO [train.py:1031] (0/4) Epoch 6, batch 2000, loss[loss=0.2005, simple_loss=0.3006, pruned_loss=0.05024, over 16733.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3102, pruned_loss=0.07124, over 20754983.46 frames. ], batch size: 81, lr: 6.49e-03, grad_scale: 32.0 2023-10-10 10:23:30,541 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.46 vs. limit=15.0 2023-10-10 10:24:21,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=328136.6666666667, ans=0.125 2023-10-10 10:24:33,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=328183.3333333333, ans=0.125 2023-10-10 10:24:41,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.21 vs. limit=15.0 2023-10-10 10:24:46,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=22.5 2023-10-10 10:24:48,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=328276.6666666667, ans=0.125 2023-10-10 10:25:02,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.692e+02 1.876e+02 2.056e+02 3.072e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-10 10:26:21,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-10-10 10:26:42,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=328650.0, ans=0.125 2023-10-10 10:26:42,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=328650.0, ans=0.0 2023-10-10 10:26:46,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=328650.0, ans=0.09899494936611666 2023-10-10 10:26:52,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=328696.6666666667, ans=0.0 2023-10-10 10:26:53,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=328696.6666666667, ans=0.1 2023-10-10 10:26:58,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=328696.6666666667, ans=0.04949747468305833 2023-10-10 10:27:16,235 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.761e+02 1.994e+02 2.373e+02 3.351e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-10 10:27:25,346 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.64 vs. limit=15.0 2023-10-10 10:27:27,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.51 vs. limit=10.0 2023-10-10 10:27:32,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=328836.6666666667, ans=0.0 2023-10-10 10:27:40,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=328883.3333333333, ans=0.07 2023-10-10 10:27:46,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=328883.3333333333, ans=0.025 2023-10-10 10:28:16,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=329023.3333333333, ans=0.1 2023-10-10 10:28:18,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.92 vs. limit=10.0 2023-10-10 10:28:22,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=329070.0, ans=0.125 2023-10-10 10:28:23,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=329070.0, ans=0.2 2023-10-10 10:28:23,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=329070.0, ans=0.2 2023-10-10 10:28:42,774 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:28:57,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=329210.0, ans=0.125 2023-10-10 10:28:58,487 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.74 vs. limit=6.0 2023-10-10 10:29:03,878 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.36 vs. limit=15.0 2023-10-10 10:29:06,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.774e+02 1.956e+02 2.277e+02 3.393e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-10 10:29:20,368 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.75 vs. limit=22.5 2023-10-10 10:29:29,206 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-10-10 10:29:35,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=329396.6666666667, ans=0.125 2023-10-10 10:30:05,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=329490.0, ans=0.125 2023-10-10 10:30:28,309 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-10-10 10:30:29,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=329630.0, ans=0.0 2023-10-10 10:30:40,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.15 vs. limit=15.0 2023-10-10 10:30:50,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=329676.6666666667, ans=0.1 2023-10-10 10:30:53,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=329723.3333333333, ans=0.0 2023-10-10 10:30:55,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.775e+02 1.973e+02 2.192e+02 3.529e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-10 10:30:56,115 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.58 vs. limit=10.0 2023-10-10 10:30:56,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=329723.3333333333, ans=0.125 2023-10-10 10:31:03,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=329770.0, ans=0.125 2023-10-10 10:31:41,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=329910.0, ans=0.125 2023-10-10 10:31:54,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=330003.3333333333, ans=0.0 2023-10-10 10:31:57,202 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.70 vs. limit=12.0 2023-10-10 10:32:05,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=330050.0, ans=0.0 2023-10-10 10:32:36,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=330190.0, ans=0.2 2023-10-10 10:32:39,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.789e+02 1.929e+02 2.145e+02 3.593e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-10 10:32:47,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330236.6666666667, ans=0.1 2023-10-10 10:32:57,379 INFO [train.py:1031] (0/4) Epoch 6, batch 2500, loss[loss=0.2098, simple_loss=0.2689, pruned_loss=0.07532, over 12418.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3109, pruned_loss=0.07169, over 23467725.49 frames. ], batch size: 440, lr: 6.47e-03, grad_scale: 32.0 2023-10-10 10:33:22,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=330376.6666666667, ans=0.0 2023-10-10 10:33:24,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330376.6666666667, ans=0.1 2023-10-10 10:33:28,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=330423.3333333333, ans=0.1 2023-10-10 10:33:39,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=330470.0, ans=0.1 2023-10-10 10:33:44,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=330470.0, ans=0.0 2023-10-10 10:33:56,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=330516.6666666667, ans=0.07 2023-10-10 10:34:17,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=330610.0, ans=0.125 2023-10-10 10:34:19,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=330656.6666666667, ans=0.0 2023-10-10 10:34:23,155 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.814e+02 1.991e+02 2.240e+02 3.669e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-10 10:34:26,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=330656.6666666667, ans=0.0 2023-10-10 10:34:35,794 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.88 vs. limit=15.0 2023-10-10 10:34:36,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=330703.3333333333, ans=0.125 2023-10-10 10:34:57,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=330796.6666666667, ans=0.125 2023-10-10 10:34:59,280 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.46 vs. limit=15.0 2023-10-10 10:35:12,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=330843.3333333333, ans=0.125 2023-10-10 10:35:26,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=330936.6666666667, ans=0.2 2023-10-10 10:35:42,554 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.15 vs. limit=22.5 2023-10-10 10:35:43,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=330983.3333333333, ans=10.0 2023-10-10 10:35:44,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=331030.0, ans=0.125 2023-10-10 10:35:51,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.47 vs. limit=15.0 2023-10-10 10:35:55,125 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=8.0 2023-10-10 10:35:59,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=331076.6666666667, ans=0.125 2023-10-10 10:36:02,734 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:36:08,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=331123.3333333333, ans=0.0 2023-10-10 10:36:09,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.738e+02 2.130e+02 2.615e+02 4.705e+02, threshold=4.259e+02, percent-clipped=1.0 2023-10-10 10:36:29,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=331170.0, ans=0.09899494936611666 2023-10-10 10:36:34,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=331216.6666666667, ans=0.2 2023-10-10 10:36:44,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=331216.6666666667, ans=0.0 2023-10-10 10:36:44,349 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.90 vs. limit=22.5 2023-10-10 10:36:57,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=331310.0, ans=0.125 2023-10-10 10:37:00,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.78 vs. limit=22.5 2023-10-10 10:37:06,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=331310.0, ans=0.125 2023-10-10 10:37:41,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=331450.0, ans=0.0 2023-10-10 10:37:46,821 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.24 vs. limit=22.5 2023-10-10 10:38:14,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.723e+02 1.916e+02 2.197e+02 3.578e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-10 10:38:35,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=331683.3333333333, ans=0.1 2023-10-10 10:38:43,711 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.47 vs. limit=15.0 2023-10-10 10:39:00,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=12.0 2023-10-10 10:39:03,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=331776.6666666667, ans=0.2 2023-10-10 10:39:41,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=331916.6666666667, ans=0.0 2023-10-10 10:39:57,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=331963.3333333333, ans=0.125 2023-10-10 10:40:16,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.671e+02 1.891e+02 2.249e+02 3.716e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-10 10:40:17,016 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:40:29,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=332103.3333333333, ans=0.5 2023-10-10 10:40:36,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=332103.3333333333, ans=0.125 2023-10-10 10:40:42,544 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.13 vs. limit=15.0 2023-10-10 10:40:58,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=332196.6666666667, ans=0.125 2023-10-10 10:41:02,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=332243.3333333333, ans=0.0 2023-10-10 10:41:11,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=332243.3333333333, ans=0.0 2023-10-10 10:41:31,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=332336.6666666667, ans=0.0 2023-10-10 10:41:33,608 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.15 vs. limit=15.0 2023-10-10 10:41:46,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=332430.0, ans=0.125 2023-10-10 10:41:55,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=332430.0, ans=0.125 2023-10-10 10:42:12,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.686e+02 1.977e+02 2.244e+02 2.741e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-10 10:42:19,026 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-10 10:42:19,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=332570.0, ans=0.2 2023-10-10 10:42:30,921 INFO [train.py:1031] (0/4) Epoch 6, batch 3000, loss[loss=0.2042, simple_loss=0.2965, pruned_loss=0.05599, over 16867.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3101, pruned_loss=0.0718, over 25517360.42 frames. ], batch size: 98, lr: 6.45e-03, grad_scale: 32.0 2023-10-10 10:42:43,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=332663.3333333333, ans=0.125 2023-10-10 10:42:49,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.49 vs. limit=22.5 2023-10-10 10:43:02,815 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.39 vs. limit=15.0 2023-10-10 10:43:08,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=332756.6666666667, ans=0.125 2023-10-10 10:43:13,374 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-10-10 10:43:17,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=332803.3333333333, ans=0.0 2023-10-10 10:43:22,801 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=22.5 2023-10-10 10:43:32,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=332850.0, ans=0.2 2023-10-10 10:43:32,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332850.0, ans=0.1 2023-10-10 10:43:32,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=332896.6666666667, ans=0.1 2023-10-10 10:43:51,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=332943.3333333333, ans=0.0 2023-10-10 10:43:56,728 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.708e+02 1.886e+02 2.047e+02 3.045e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-10 10:44:21,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=333083.3333333333, ans=0.0 2023-10-10 10:44:35,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=333130.0, ans=0.125 2023-10-10 10:44:48,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=333176.6666666667, ans=0.125 2023-10-10 10:44:54,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=333223.3333333333, ans=0.125 2023-10-10 10:45:10,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=333270.0, ans=0.07 2023-10-10 10:45:38,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=333410.0, ans=0.025 2023-10-10 10:45:41,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.02 vs. limit=10.0 2023-10-10 10:45:42,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333410.0, ans=0.1 2023-10-10 10:45:48,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.670e+02 1.874e+02 2.211e+02 3.569e+02, threshold=3.748e+02, percent-clipped=0.0 2023-10-10 10:45:49,635 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.65 vs. limit=22.5 2023-10-10 10:46:00,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=333503.3333333333, ans=0.125 2023-10-10 10:46:15,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=333550.0, ans=0.0 2023-10-10 10:46:18,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=333596.6666666667, ans=15.0 2023-10-10 10:46:43,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=333643.3333333333, ans=0.125 2023-10-10 10:46:47,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=333690.0, ans=0.0 2023-10-10 10:46:49,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=333690.0, ans=0.125 2023-10-10 10:47:04,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=333736.6666666667, ans=0.0 2023-10-10 10:47:39,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=333876.6666666667, ans=0.125 2023-10-10 10:47:43,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=333876.6666666667, ans=0.1 2023-10-10 10:47:46,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=333923.3333333333, ans=0.125 2023-10-10 10:47:50,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.728e+02 1.860e+02 2.157e+02 3.260e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-10 10:47:50,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=333923.3333333333, ans=0.125 2023-10-10 10:47:57,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.88 vs. limit=15.0 2023-10-10 10:47:58,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=333970.0, ans=0.125 2023-10-10 10:48:09,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=334016.6666666667, ans=0.2 2023-10-10 10:48:29,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=334063.3333333333, ans=0.0 2023-10-10 10:48:45,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=334156.6666666667, ans=0.0 2023-10-10 10:48:49,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.37 vs. limit=22.5 2023-10-10 10:48:49,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=334156.6666666667, ans=0.125 2023-10-10 10:48:56,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=334203.3333333333, ans=0.0 2023-10-10 10:49:04,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=334250.0, ans=0.125 2023-10-10 10:49:31,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=334343.3333333333, ans=0.1 2023-10-10 10:49:39,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=22.5 2023-10-10 10:49:41,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.695e+02 1.895e+02 2.177e+02 3.583e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-10 10:49:47,024 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:49:53,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=334436.6666666667, ans=0.0 2023-10-10 10:49:57,449 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-10-10 10:50:02,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=334483.3333333333, ans=0.125 2023-10-10 10:50:09,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-10 10:50:14,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=334530.0, ans=0.09899494936611666 2023-10-10 10:50:41,675 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.22 vs. limit=15.0 2023-10-10 10:50:41,720 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.89 vs. limit=10.0 2023-10-10 10:51:02,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=12.0 2023-10-10 10:51:02,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=334716.6666666667, ans=0.0 2023-10-10 10:51:06,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=334716.6666666667, ans=0.1 2023-10-10 10:51:23,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=334810.0, ans=0.0 2023-10-10 10:51:36,866 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.720e+02 1.948e+02 2.139e+02 3.255e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-10 10:51:52,174 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.97 vs. limit=22.5 2023-10-10 10:51:54,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=334950.0, ans=0.1 2023-10-10 10:51:54,840 INFO [train.py:1031] (0/4) Epoch 6, batch 3500, loss[loss=0.2493, simple_loss=0.3304, pruned_loss=0.08406, over 15997.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3096, pruned_loss=0.07155, over 27115188.29 frames. ], batch size: 296, lr: 6.42e-03, grad_scale: 16.0 2023-10-10 10:52:09,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=334996.6666666667, ans=0.125 2023-10-10 10:52:15,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=335043.3333333333, ans=0.125 2023-10-10 10:52:25,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=335043.3333333333, ans=0.2 2023-10-10 10:52:28,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=335090.0, ans=0.0 2023-10-10 10:52:32,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=335090.0, ans=0.0 2023-10-10 10:52:37,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=335090.0, ans=0.0 2023-10-10 10:52:59,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=335230.0, ans=0.05 2023-10-10 10:53:18,325 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.89 vs. limit=22.5 2023-10-10 10:53:34,794 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.712e+02 1.901e+02 2.198e+02 3.386e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-10 10:53:43,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=335370.0, ans=0.025 2023-10-10 10:54:28,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=335556.6666666667, ans=0.125 2023-10-10 10:54:36,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=335556.6666666667, ans=0.0 2023-10-10 10:54:43,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=335603.3333333333, ans=0.125 2023-10-10 10:54:54,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=335650.0, ans=0.1 2023-10-10 10:55:06,745 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=12.0 2023-10-10 10:55:07,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=335696.6666666667, ans=0.1 2023-10-10 10:55:09,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=335696.6666666667, ans=15.0 2023-10-10 10:55:12,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=335743.3333333333, ans=0.025 2023-10-10 10:55:20,888 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.99 vs. limit=15.0 2023-10-10 10:55:25,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=335790.0, ans=0.0 2023-10-10 10:55:27,352 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.35 vs. limit=22.5 2023-10-10 10:55:28,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=335790.0, ans=0.0 2023-10-10 10:55:29,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.723e+02 1.919e+02 2.327e+02 3.270e+02, threshold=3.837e+02, percent-clipped=0.0 2023-10-10 10:55:32,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=335790.0, ans=0.125 2023-10-10 10:55:34,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=335836.6666666667, ans=0.0 2023-10-10 10:55:37,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.65 vs. limit=15.0 2023-10-10 10:55:48,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=335883.3333333333, ans=0.125 2023-10-10 10:56:01,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=335930.0, ans=6.0 2023-10-10 10:56:10,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=335976.6666666667, ans=0.1 2023-10-10 10:56:13,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=335976.6666666667, ans=0.07 2023-10-10 10:56:15,820 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-72000.pt 2023-10-10 10:56:40,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=336070.0, ans=0.0 2023-10-10 10:57:10,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=336210.0, ans=0.1 2023-10-10 10:57:21,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=336210.0, ans=0.125 2023-10-10 10:57:29,756 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.659e+02 1.832e+02 2.147e+02 3.516e+02, threshold=3.664e+02, percent-clipped=0.0 2023-10-10 10:57:33,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=336256.6666666667, ans=0.125 2023-10-10 10:57:35,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=336303.3333333333, ans=0.125 2023-10-10 10:58:02,566 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 10:58:16,826 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.05 vs. limit=15.0 2023-10-10 10:58:21,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=336490.0, ans=0.1 2023-10-10 10:58:36,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=336536.6666666667, ans=0.125 2023-10-10 10:59:21,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.635e+02 1.838e+02 1.989e+02 3.389e+02, threshold=3.677e+02, percent-clipped=0.0 2023-10-10 10:59:25,826 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2023-10-10 10:59:27,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=336770.0, ans=0.125 2023-10-10 10:59:46,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=336816.6666666667, ans=0.07 2023-10-10 10:59:51,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=336863.3333333333, ans=0.0 2023-10-10 10:59:53,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=336863.3333333333, ans=0.125 2023-10-10 11:00:01,837 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.95 vs. limit=22.5 2023-10-10 11:00:09,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=336910.0, ans=0.0 2023-10-10 11:00:10,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=336956.6666666667, ans=0.1 2023-10-10 11:00:11,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=336956.6666666667, ans=0.125 2023-10-10 11:00:11,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=336956.6666666667, ans=0.125 2023-10-10 11:00:36,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=337050.0, ans=10.0 2023-10-10 11:00:36,913 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:00:37,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=337050.0, ans=0.125 2023-10-10 11:00:40,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=337096.6666666667, ans=0.125 2023-10-10 11:00:41,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.53 vs. limit=15.0 2023-10-10 11:00:43,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=337096.6666666667, ans=0.125 2023-10-10 11:00:45,422 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:00:47,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337096.6666666667, ans=0.1 2023-10-10 11:00:57,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=337143.3333333333, ans=0.125 2023-10-10 11:01:09,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.699e+02 1.853e+02 2.048e+02 2.920e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-10 11:01:21,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-10-10 11:01:23,694 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-10-10 11:01:26,876 INFO [train.py:1031] (0/4) Epoch 6, batch 4000, loss[loss=0.2232, simple_loss=0.3117, pruned_loss=0.06732, over 16924.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3089, pruned_loss=0.07122, over 28402220.00 frames. ], batch size: 138, lr: 6.40e-03, grad_scale: 32.0 2023-10-10 11:02:07,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=337423.3333333333, ans=10.0 2023-10-10 11:02:10,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=337423.3333333333, ans=0.2 2023-10-10 11:02:20,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=337470.0, ans=0.125 2023-10-10 11:02:26,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=337516.6666666667, ans=0.2 2023-10-10 11:02:29,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=337516.6666666667, ans=0.0 2023-10-10 11:02:46,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.38 vs. limit=6.0 2023-10-10 11:02:59,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=337656.6666666667, ans=0.125 2023-10-10 11:03:00,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.785e+02 1.962e+02 2.190e+02 3.824e+02, threshold=3.923e+02, percent-clipped=1.0 2023-10-10 11:03:20,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337750.0, ans=0.1 2023-10-10 11:03:27,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337796.6666666667, ans=0.1 2023-10-10 11:03:34,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=337796.6666666667, ans=0.1 2023-10-10 11:03:45,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=337843.3333333333, ans=0.125 2023-10-10 11:04:03,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=337936.6666666667, ans=0.2 2023-10-10 11:04:17,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=337983.3333333333, ans=0.125 2023-10-10 11:04:32,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=338030.0, ans=0.125 2023-10-10 11:04:45,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=338076.6666666667, ans=0.125 2023-10-10 11:04:51,194 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.54 vs. limit=22.5 2023-10-10 11:05:03,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.786e+02 2.007e+02 2.280e+02 3.706e+02, threshold=4.013e+02, percent-clipped=0.0 2023-10-10 11:05:18,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-10 11:05:22,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=338216.6666666667, ans=0.0 2023-10-10 11:05:34,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=338263.3333333333, ans=0.0 2023-10-10 11:05:35,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=338263.3333333333, ans=0.0 2023-10-10 11:05:42,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=338310.0, ans=0.125 2023-10-10 11:05:44,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=338310.0, ans=0.0 2023-10-10 11:05:52,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=338310.0, ans=0.0 2023-10-10 11:05:58,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=338356.6666666667, ans=0.04949747468305833 2023-10-10 11:06:13,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=338403.3333333333, ans=0.09899494936611666 2023-10-10 11:06:18,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=338450.0, ans=0.125 2023-10-10 11:06:19,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=338450.0, ans=0.125 2023-10-10 11:06:41,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=338543.3333333333, ans=0.2 2023-10-10 11:06:51,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.711e+02 1.838e+02 2.041e+02 2.638e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-10 11:07:05,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=338636.6666666667, ans=0.1 2023-10-10 11:07:32,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=338776.6666666667, ans=0.125 2023-10-10 11:07:47,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=338823.3333333333, ans=0.125 2023-10-10 11:07:48,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=338823.3333333333, ans=0.0 2023-10-10 11:07:52,598 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.68 vs. limit=10.0 2023-10-10 11:07:55,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=338870.0, ans=0.2 2023-10-10 11:07:58,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=338870.0, ans=0.05 2023-10-10 11:08:06,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=338916.6666666667, ans=0.025 2023-10-10 11:08:11,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=338963.3333333333, ans=0.125 2023-10-10 11:08:42,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.859e+02 2.141e+02 2.428e+02 3.490e+02, threshold=4.281e+02, percent-clipped=0.0 2023-10-10 11:08:49,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=339103.3333333333, ans=0.2 2023-10-10 11:08:51,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=339103.3333333333, ans=0.0 2023-10-10 11:09:18,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.41 vs. limit=12.0 2023-10-10 11:09:38,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=339290.0, ans=0.125 2023-10-10 11:10:08,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=339383.3333333333, ans=0.125 2023-10-10 11:10:25,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=339430.0, ans=0.125 2023-10-10 11:10:40,665 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.77 vs. limit=10.0 2023-10-10 11:10:43,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.701e+02 1.873e+02 2.154e+02 3.012e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-10 11:10:59,685 INFO [train.py:1031] (0/4) Epoch 6, batch 4500, loss[loss=0.1891, simple_loss=0.2694, pruned_loss=0.05437, over 15504.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3092, pruned_loss=0.07099, over 29388059.88 frames. ], batch size: 35, lr: 6.38e-03, grad_scale: 32.0 2023-10-10 11:11:25,990 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=12.0 2023-10-10 11:11:36,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=339756.6666666667, ans=0.025 2023-10-10 11:11:48,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=339803.3333333333, ans=0.09899494936611666 2023-10-10 11:11:55,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=339850.0, ans=0.2 2023-10-10 11:12:24,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=339990.0, ans=0.2 2023-10-10 11:12:30,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.786e+02 1.964e+02 2.243e+02 3.413e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-10 11:12:36,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=340036.6666666667, ans=0.125 2023-10-10 11:12:47,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=340083.3333333333, ans=0.125 2023-10-10 11:12:52,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=340083.3333333333, ans=0.125 2023-10-10 11:12:59,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=340130.0, ans=0.0 2023-10-10 11:12:59,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=340130.0, ans=0.125 2023-10-10 11:13:09,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=340176.6666666667, ans=0.1 2023-10-10 11:13:11,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340176.6666666667, ans=0.1 2023-10-10 11:13:14,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=340176.6666666667, ans=0.125 2023-10-10 11:13:15,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.47 vs. limit=10.0 2023-10-10 11:13:23,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=340223.3333333333, ans=0.2 2023-10-10 11:13:32,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-10-10 11:13:38,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=340316.6666666667, ans=0.125 2023-10-10 11:13:39,265 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-10-10 11:13:45,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=340316.6666666667, ans=0.125 2023-10-10 11:14:00,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=340410.0, ans=0.07 2023-10-10 11:14:08,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=340410.0, ans=0.09899494936611666 2023-10-10 11:14:16,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.817e+02 2.055e+02 2.256e+02 3.453e+02, threshold=4.110e+02, percent-clipped=0.0 2023-10-10 11:14:39,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=340550.0, ans=0.0 2023-10-10 11:14:44,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340596.6666666667, ans=0.1 2023-10-10 11:15:05,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=340690.0, ans=0.0 2023-10-10 11:15:07,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340690.0, ans=0.1 2023-10-10 11:15:26,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340783.3333333333, ans=0.1 2023-10-10 11:15:47,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=340876.6666666667, ans=0.125 2023-10-10 11:15:51,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=340876.6666666667, ans=0.0 2023-10-10 11:15:58,495 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.662e+02 1.830e+02 2.207e+02 3.074e+02, threshold=3.660e+02, percent-clipped=0.0 2023-10-10 11:16:09,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=340970.0, ans=0.125 2023-10-10 11:16:16,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=341016.6666666667, ans=0.125 2023-10-10 11:16:17,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=341016.6666666667, ans=0.1 2023-10-10 11:16:18,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=341016.6666666667, ans=0.125 2023-10-10 11:16:31,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=341063.3333333333, ans=0.0 2023-10-10 11:16:52,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=341156.6666666667, ans=0.2 2023-10-10 11:16:53,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=341156.6666666667, ans=0.125 2023-10-10 11:17:00,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=341156.6666666667, ans=0.125 2023-10-10 11:17:01,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=341156.6666666667, ans=0.125 2023-10-10 11:17:03,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.76 vs. limit=10.0 2023-10-10 11:17:30,518 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.70 vs. limit=22.5 2023-10-10 11:17:38,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=341343.3333333333, ans=0.125 2023-10-10 11:17:42,058 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.97 vs. limit=15.0 2023-10-10 11:17:51,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=341390.0, ans=0.2 2023-10-10 11:17:54,517 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.730e+02 1.914e+02 2.194e+02 3.312e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-10 11:18:03,552 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.86 vs. limit=15.0 2023-10-10 11:18:13,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=341483.3333333333, ans=0.125 2023-10-10 11:18:14,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.36 vs. limit=22.5 2023-10-10 11:18:36,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=341576.6666666667, ans=0.0 2023-10-10 11:18:43,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=341576.6666666667, ans=0.125 2023-10-10 11:18:55,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=341623.3333333333, ans=0.2 2023-10-10 11:19:29,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=341763.3333333333, ans=0.0 2023-10-10 11:19:37,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=341810.0, ans=0.0 2023-10-10 11:19:42,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.56 vs. limit=22.5 2023-10-10 11:19:47,535 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.771e+02 1.959e+02 2.251e+02 3.276e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-10 11:19:58,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=341903.3333333333, ans=0.125 2023-10-10 11:20:04,305 INFO [train.py:1031] (0/4) Epoch 6, batch 5000, loss[loss=0.2902, simple_loss=0.3406, pruned_loss=0.1199, over 15587.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3089, pruned_loss=0.07118, over 30128929.57 frames. ], batch size: 350, lr: 6.36e-03, grad_scale: 32.0 2023-10-10 11:20:18,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=341996.6666666667, ans=0.125 2023-10-10 11:20:23,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.62 vs. limit=15.0 2023-10-10 11:20:26,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=341996.6666666667, ans=0.0 2023-10-10 11:20:27,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=341996.6666666667, ans=0.1 2023-10-10 11:20:32,482 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.41 vs. limit=10.0 2023-10-10 11:21:04,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=342183.3333333333, ans=0.125 2023-10-10 11:21:05,172 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:21:07,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=342183.3333333333, ans=0.125 2023-10-10 11:21:19,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=342230.0, ans=10.0 2023-10-10 11:21:36,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.848e+02 2.020e+02 2.423e+02 3.330e+02, threshold=4.040e+02, percent-clipped=0.0 2023-10-10 11:21:40,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=342323.3333333333, ans=0.0 2023-10-10 11:21:49,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-10 11:22:32,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=342556.6666666667, ans=0.2 2023-10-10 11:22:40,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.64 vs. limit=15.0 2023-10-10 11:22:50,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=342650.0, ans=0.1 2023-10-10 11:23:03,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=342696.6666666667, ans=0.125 2023-10-10 11:23:27,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.897e+02 2.113e+02 2.407e+02 3.339e+02, threshold=4.227e+02, percent-clipped=0.0 2023-10-10 11:23:29,876 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:23:36,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.88 vs. limit=22.5 2023-10-10 11:23:42,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=342883.3333333333, ans=0.0 2023-10-10 11:23:44,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=342883.3333333333, ans=0.0 2023-10-10 11:23:53,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=342930.0, ans=0.0 2023-10-10 11:24:01,626 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.77 vs. limit=15.0 2023-10-10 11:24:23,746 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.66 vs. limit=10.0 2023-10-10 11:24:24,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=3.92 vs. limit=12.0 2023-10-10 11:24:35,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=343116.6666666667, ans=0.0 2023-10-10 11:24:44,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=343116.6666666667, ans=0.125 2023-10-10 11:25:01,160 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.89 vs. limit=10.0 2023-10-10 11:25:07,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=343210.0, ans=0.2 2023-10-10 11:25:10,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.61 vs. limit=22.5 2023-10-10 11:25:17,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.688e+02 1.884e+02 2.177e+02 2.925e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-10 11:25:26,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=343303.3333333333, ans=0.125 2023-10-10 11:25:27,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=343303.3333333333, ans=0.125 2023-10-10 11:25:36,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=343350.0, ans=0.0 2023-10-10 11:25:36,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=343350.0, ans=0.2 2023-10-10 11:25:40,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=343350.0, ans=0.1 2023-10-10 11:25:52,048 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.20 vs. limit=15.0 2023-10-10 11:25:52,710 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:26:29,534 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.88 vs. limit=6.0 2023-10-10 11:26:33,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.81 vs. limit=10.0 2023-10-10 11:26:38,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=343583.3333333333, ans=0.125 2023-10-10 11:26:47,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=343630.0, ans=0.0 2023-10-10 11:26:51,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=343630.0, ans=0.2 2023-10-10 11:26:57,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.94 vs. limit=15.0 2023-10-10 11:26:58,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=343676.6666666667, ans=0.125 2023-10-10 11:27:08,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.860e+02 2.178e+02 2.540e+02 4.355e+02, threshold=4.355e+02, percent-clipped=1.0 2023-10-10 11:27:11,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=343723.3333333333, ans=0.1 2023-10-10 11:27:37,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=343863.3333333333, ans=0.125 2023-10-10 11:27:42,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=343863.3333333333, ans=0.125 2023-10-10 11:27:44,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=343910.0, ans=0.125 2023-10-10 11:28:02,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=343956.6666666667, ans=0.0 2023-10-10 11:28:03,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=343956.6666666667, ans=0.125 2023-10-10 11:28:07,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=344003.3333333333, ans=0.125 2023-10-10 11:28:34,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=344096.6666666667, ans=0.02 2023-10-10 11:28:52,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=344190.0, ans=0.125 2023-10-10 11:28:53,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.709e+02 1.830e+02 1.991e+02 2.724e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-10 11:28:54,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=344190.0, ans=0.125 2023-10-10 11:28:56,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=344190.0, ans=0.125 2023-10-10 11:29:07,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=344283.3333333333, ans=0.0 2023-10-10 11:29:08,357 INFO [train.py:1031] (0/4) Epoch 6, batch 5500, loss[loss=0.2063, simple_loss=0.3007, pruned_loss=0.05594, over 16857.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3085, pruned_loss=0.07079, over 30727854.69 frames. ], batch size: 98, lr: 6.34e-03, grad_scale: 32.0 2023-10-10 11:29:23,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-10-10 11:29:39,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.30 vs. limit=6.0 2023-10-10 11:29:54,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=344470.0, ans=0.07 2023-10-10 11:30:15,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=344563.3333333333, ans=0.125 2023-10-10 11:30:34,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=344656.6666666667, ans=0.1 2023-10-10 11:30:40,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.759e+02 1.903e+02 2.067e+02 2.738e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 11:30:43,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-10-10 11:30:56,242 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.60 vs. limit=10.0 2023-10-10 11:31:24,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=344843.3333333333, ans=0.0 2023-10-10 11:31:29,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=344890.0, ans=0.0 2023-10-10 11:31:32,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=344890.0, ans=0.2 2023-10-10 11:31:34,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=344890.0, ans=0.125 2023-10-10 11:31:40,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=344936.6666666667, ans=0.1 2023-10-10 11:31:41,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=344936.6666666667, ans=0.0 2023-10-10 11:31:41,232 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=15.0 2023-10-10 11:31:45,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=344936.6666666667, ans=0.0 2023-10-10 11:31:46,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=344936.6666666667, ans=0.1 2023-10-10 11:31:52,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=344983.3333333333, ans=0.0 2023-10-10 11:31:59,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.44 vs. limit=15.0 2023-10-10 11:31:59,912 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.61 vs. limit=6.0 2023-10-10 11:32:01,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=345030.0, ans=0.125 2023-10-10 11:32:14,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=345076.6666666667, ans=0.0 2023-10-10 11:32:22,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=345076.6666666667, ans=0.125 2023-10-10 11:32:32,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.705e+02 1.912e+02 2.140e+02 3.023e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-10 11:32:44,122 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:32:45,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=345170.0, ans=0.0 2023-10-10 11:32:49,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.10 vs. limit=15.0 2023-10-10 11:32:51,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=345216.6666666667, ans=0.09899494936611666 2023-10-10 11:32:59,661 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:33:11,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=345310.0, ans=0.0 2023-10-10 11:33:15,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.50 vs. limit=15.0 2023-10-10 11:33:40,203 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.91 vs. limit=15.0 2023-10-10 11:34:00,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=345496.6666666667, ans=0.125 2023-10-10 11:34:02,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=345496.6666666667, ans=0.0 2023-10-10 11:34:02,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-10-10 11:34:05,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=345496.6666666667, ans=0.125 2023-10-10 11:34:19,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=345590.0, ans=0.125 2023-10-10 11:34:26,392 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:34:28,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.681e+02 1.859e+02 2.178e+02 3.343e+02, threshold=3.717e+02, percent-clipped=0.0 2023-10-10 11:34:28,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=345590.0, ans=0.2 2023-10-10 11:34:37,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=345636.6666666667, ans=0.1 2023-10-10 11:34:40,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=345636.6666666667, ans=12.0 2023-10-10 11:34:40,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=345636.6666666667, ans=0.0 2023-10-10 11:34:45,904 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.41 vs. limit=15.0 2023-10-10 11:34:47,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=345683.3333333333, ans=0.125 2023-10-10 11:34:53,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=345730.0, ans=10.0 2023-10-10 11:34:53,493 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-10-10 11:35:09,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=345776.6666666667, ans=0.0 2023-10-10 11:35:11,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=345776.6666666667, ans=0.2 2023-10-10 11:35:18,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=345823.3333333333, ans=0.1 2023-10-10 11:35:26,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=345870.0, ans=0.125 2023-10-10 11:35:33,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=345870.0, ans=0.2 2023-10-10 11:35:34,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=345870.0, ans=0.125 2023-10-10 11:36:19,009 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.688e+02 1.854e+02 2.078e+02 3.197e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-10 11:36:39,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=346150.0, ans=0.2 2023-10-10 11:36:44,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=346150.0, ans=0.125 2023-10-10 11:36:50,354 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:36:55,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.11 vs. limit=15.0 2023-10-10 11:36:55,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=346243.3333333333, ans=0.125 2023-10-10 11:37:28,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=346336.6666666667, ans=0.0 2023-10-10 11:37:38,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=346383.3333333333, ans=0.0 2023-10-10 11:37:43,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346430.0, ans=0.1 2023-10-10 11:37:54,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=346476.6666666667, ans=0.0 2023-10-10 11:37:55,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=346476.6666666667, ans=0.1 2023-10-10 11:37:57,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=346476.6666666667, ans=0.0 2023-10-10 11:38:09,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.273e+02 1.681e+02 1.803e+02 2.113e+02 3.282e+02, threshold=3.606e+02, percent-clipped=0.0 2023-10-10 11:38:12,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.05 vs. limit=15.0 2023-10-10 11:38:25,140 INFO [train.py:1031] (0/4) Epoch 6, batch 6000, loss[loss=0.2362, simple_loss=0.3156, pruned_loss=0.07839, over 16943.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3087, pruned_loss=0.07094, over 31186819.37 frames. ], batch size: 77, lr: 6.32e-03, grad_scale: 32.0 2023-10-10 11:38:29,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=346616.6666666667, ans=0.0 2023-10-10 11:38:31,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=346616.6666666667, ans=0.0 2023-10-10 11:38:37,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=346663.3333333333, ans=0.0 2023-10-10 11:38:57,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=346756.6666666667, ans=0.125 2023-10-10 11:38:58,111 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-10-10 11:39:01,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=346756.6666666667, ans=0.125 2023-10-10 11:39:26,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=346850.0, ans=0.07 2023-10-10 11:39:35,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=346896.6666666667, ans=0.125 2023-10-10 11:39:45,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=346943.3333333333, ans=0.125 2023-10-10 11:39:59,320 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.748e+02 1.895e+02 2.344e+02 4.029e+02, threshold=3.790e+02, percent-clipped=1.0 2023-10-10 11:40:08,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=347036.6666666667, ans=0.125 2023-10-10 11:40:08,774 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.34 vs. limit=6.0 2023-10-10 11:40:14,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=347083.3333333333, ans=0.125 2023-10-10 11:40:27,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=347130.0, ans=0.2 2023-10-10 11:40:30,960 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.36 vs. limit=10.0 2023-10-10 11:40:34,310 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.13 vs. limit=22.5 2023-10-10 11:40:35,603 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-10-10 11:40:48,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=347223.3333333333, ans=0.0 2023-10-10 11:40:52,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.60 vs. limit=15.0 2023-10-10 11:41:08,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=347270.0, ans=0.07 2023-10-10 11:41:15,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=347316.6666666667, ans=0.125 2023-10-10 11:41:25,907 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0 2023-10-10 11:41:33,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-10 11:41:50,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.749e+02 1.951e+02 2.157e+02 3.026e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-10 11:42:18,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=347596.6666666667, ans=0.125 2023-10-10 11:42:18,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=347596.6666666667, ans=0.1 2023-10-10 11:42:47,818 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.45 vs. limit=15.0 2023-10-10 11:42:50,270 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:42:50,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=347736.6666666667, ans=0.125 2023-10-10 11:42:56,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=347736.6666666667, ans=0.1 2023-10-10 11:42:59,697 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-10-10 11:43:11,736 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-10-10 11:43:33,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=347923.3333333333, ans=0.0 2023-10-10 11:43:41,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.819e+02 1.996e+02 2.270e+02 2.742e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-10 11:44:03,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=348016.6666666667, ans=0.1 2023-10-10 11:44:03,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=348016.6666666667, ans=0.05 2023-10-10 11:44:03,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=348016.6666666667, ans=0.2 2023-10-10 11:44:04,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=348016.6666666667, ans=0.125 2023-10-10 11:44:54,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.73 vs. limit=10.0 2023-10-10 11:45:04,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=348250.0, ans=0.125 2023-10-10 11:45:36,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=348390.0, ans=0.125 2023-10-10 11:45:43,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.781e+02 1.963e+02 2.216e+02 3.647e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-10 11:45:52,608 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.58 vs. limit=15.0 2023-10-10 11:45:56,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348483.3333333333, ans=0.1 2023-10-10 11:46:01,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-10-10 11:46:42,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=348670.0, ans=0.125 2023-10-10 11:46:47,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=348670.0, ans=0.2 2023-10-10 11:46:49,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=348716.6666666667, ans=0.04949747468305833 2023-10-10 11:46:57,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=348716.6666666667, ans=0.125 2023-10-10 11:47:07,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=348763.3333333333, ans=0.1 2023-10-10 11:47:08,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=348763.3333333333, ans=0.1 2023-10-10 11:47:19,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=348810.0, ans=0.125 2023-10-10 11:47:33,785 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.754e+02 1.994e+02 2.288e+02 3.082e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-10 11:47:35,856 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:47:46,379 INFO [train.py:1031] (0/4) Epoch 6, batch 6500, loss[loss=0.212, simple_loss=0.3077, pruned_loss=0.0582, over 16992.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3092, pruned_loss=0.07102, over 31549996.93 frames. ], batch size: 93, lr: 6.30e-03, grad_scale: 32.0 2023-10-10 11:48:22,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-10-10 11:48:34,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=349090.0, ans=0.0 2023-10-10 11:48:51,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=349136.6666666667, ans=0.125 2023-10-10 11:49:00,114 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.91 vs. limit=15.0 2023-10-10 11:49:00,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-10-10 11:49:02,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=349183.3333333333, ans=0.125 2023-10-10 11:49:16,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=349276.6666666667, ans=0.0 2023-10-10 11:49:20,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=349276.6666666667, ans=0.125 2023-10-10 11:49:33,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=349323.3333333333, ans=0.2 2023-10-10 11:49:36,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.685e+02 1.952e+02 2.193e+02 3.527e+02, threshold=3.903e+02, percent-clipped=0.0 2023-10-10 11:49:52,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=349416.6666666667, ans=0.2 2023-10-10 11:49:54,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=349416.6666666667, ans=0.2 2023-10-10 11:49:55,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=349416.6666666667, ans=0.125 2023-10-10 11:50:04,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-10-10 11:50:15,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=349510.0, ans=0.0 2023-10-10 11:50:55,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=349650.0, ans=0.09899494936611666 2023-10-10 11:50:56,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349696.6666666667, ans=0.1 2023-10-10 11:50:57,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=349696.6666666667, ans=0.125 2023-10-10 11:51:19,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=349790.0, ans=0.025 2023-10-10 11:51:24,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.782e+02 2.037e+02 2.284e+02 3.707e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-10 11:51:25,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=349790.0, ans=0.0 2023-10-10 11:51:31,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=349836.6666666667, ans=0.0 2023-10-10 11:51:50,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=349930.0, ans=0.1 2023-10-10 11:52:23,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350070.0, ans=0.1 2023-10-10 11:53:17,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.609e+02 1.756e+02 1.958e+02 2.766e+02, threshold=3.512e+02, percent-clipped=0.0 2023-10-10 11:53:31,227 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:53:58,118 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 11:54:25,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=350490.0, ans=0.125 2023-10-10 11:54:31,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=350536.6666666667, ans=0.1 2023-10-10 11:54:41,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=350536.6666666667, ans=0.1 2023-10-10 11:55:16,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=350676.6666666667, ans=0.2 2023-10-10 11:55:23,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.87 vs. limit=22.5 2023-10-10 11:55:29,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.720e+02 1.935e+02 2.255e+02 3.923e+02, threshold=3.871e+02, percent-clipped=2.0 2023-10-10 11:55:32,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.30 vs. limit=22.5 2023-10-10 11:55:34,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=350770.0, ans=0.0 2023-10-10 11:55:50,795 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-10-10 11:56:04,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=350910.0, ans=0.125 2023-10-10 11:56:07,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=350910.0, ans=0.09899494936611666 2023-10-10 11:56:25,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=22.5 2023-10-10 11:56:38,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=351050.0, ans=0.0 2023-10-10 11:56:49,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=351096.6666666667, ans=0.09899494936611666 2023-10-10 11:57:14,295 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.731e+02 2.030e+02 2.454e+02 3.429e+02, threshold=4.059e+02, percent-clipped=0.0 2023-10-10 11:57:20,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.08 vs. limit=15.0 2023-10-10 11:57:26,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351283.3333333333, ans=0.1 2023-10-10 11:57:26,897 INFO [train.py:1031] (0/4) Epoch 6, batch 7000, loss[loss=0.2448, simple_loss=0.3259, pruned_loss=0.08186, over 16893.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3096, pruned_loss=0.07096, over 31828912.83 frames. ], batch size: 146, lr: 6.27e-03, grad_scale: 32.0 2023-10-10 11:57:28,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=351283.3333333333, ans=0.125 2023-10-10 11:57:36,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=351330.0, ans=0.09899494936611666 2023-10-10 11:57:46,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=351330.0, ans=0.09899494936611666 2023-10-10 11:57:47,383 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-10-10 11:58:15,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=351470.0, ans=10.0 2023-10-10 11:58:15,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=351470.0, ans=0.05 2023-10-10 11:58:16,220 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-10-10 11:58:26,109 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=22.5 2023-10-10 11:58:39,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=351563.3333333333, ans=0.125 2023-10-10 11:58:39,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=351563.3333333333, ans=0.125 2023-10-10 11:59:01,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.813e+02 1.991e+02 2.272e+02 3.301e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-10 11:59:11,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.37 vs. limit=22.5 2023-10-10 11:59:26,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=351796.6666666667, ans=0.125 2023-10-10 11:59:31,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=351796.6666666667, ans=0.0 2023-10-10 12:00:00,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=351890.0, ans=10.0 2023-10-10 12:00:02,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.93 vs. limit=10.0 2023-10-10 12:00:06,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=351936.6666666667, ans=0.2 2023-10-10 12:00:07,624 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:00:18,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=351983.3333333333, ans=0.125 2023-10-10 12:00:52,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.774e+02 1.945e+02 2.181e+02 3.277e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-10 12:00:54,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=352123.3333333333, ans=0.05 2023-10-10 12:00:57,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=352170.0, ans=0.125 2023-10-10 12:01:09,648 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.53 vs. limit=15.0 2023-10-10 12:01:36,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=352263.3333333333, ans=0.0 2023-10-10 12:02:04,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=352356.6666666667, ans=0.0 2023-10-10 12:02:45,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=352543.3333333333, ans=0.07 2023-10-10 12:02:46,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=352543.3333333333, ans=0.95 2023-10-10 12:03:02,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.675e+02 1.848e+02 2.068e+02 2.833e+02, threshold=3.695e+02, percent-clipped=0.0 2023-10-10 12:03:10,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.04 vs. limit=6.0 2023-10-10 12:03:13,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=352636.6666666667, ans=0.07 2023-10-10 12:03:26,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=352730.0, ans=0.125 2023-10-10 12:03:29,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=352730.0, ans=0.125 2023-10-10 12:03:45,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=352776.6666666667, ans=0.125 2023-10-10 12:03:51,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352823.3333333333, ans=0.1 2023-10-10 12:03:56,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=352823.3333333333, ans=0.125 2023-10-10 12:04:05,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=352870.0, ans=0.125 2023-10-10 12:04:16,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=352916.6666666667, ans=0.1 2023-10-10 12:04:41,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353010.0, ans=0.1 2023-10-10 12:04:50,927 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:04:55,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.274e+02 1.677e+02 1.893e+02 2.216e+02 2.791e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-10 12:04:56,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353056.6666666667, ans=0.1 2023-10-10 12:05:06,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=353103.3333333333, ans=0.0 2023-10-10 12:05:23,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=353196.6666666667, ans=0.2 2023-10-10 12:05:25,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=353196.6666666667, ans=0.1 2023-10-10 12:05:47,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=353290.0, ans=0.125 2023-10-10 12:06:07,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=353383.3333333333, ans=0.125 2023-10-10 12:06:07,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.41 vs. limit=10.0 2023-10-10 12:06:21,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=353430.0, ans=0.0 2023-10-10 12:06:33,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353476.6666666667, ans=0.1 2023-10-10 12:06:43,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.755e+02 1.923e+02 2.122e+02 3.396e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-10 12:06:47,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=353570.0, ans=0.07 2023-10-10 12:06:52,168 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2023-10-10 12:06:58,856 INFO [train.py:1031] (0/4) Epoch 6, batch 7500, loss[loss=0.2157, simple_loss=0.3023, pruned_loss=0.06453, over 15572.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3093, pruned_loss=0.07073, over 32043552.50 frames. ], batch size: 35, lr: 6.25e-03, grad_scale: 16.0 2023-10-10 12:07:18,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=353663.3333333333, ans=0.0 2023-10-10 12:07:23,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353710.0, ans=0.1 2023-10-10 12:07:27,579 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.55 vs. limit=15.0 2023-10-10 12:07:32,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=353756.6666666667, ans=0.2 2023-10-10 12:07:59,624 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:08:12,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=353896.6666666667, ans=0.0 2023-10-10 12:08:13,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=353896.6666666667, ans=0.0 2023-10-10 12:08:32,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=353990.0, ans=0.2 2023-10-10 12:08:34,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=353990.0, ans=0.0 2023-10-10 12:08:36,955 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.704e+02 1.940e+02 2.320e+02 4.347e+02, threshold=3.881e+02, percent-clipped=3.0 2023-10-10 12:08:49,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=354083.3333333333, ans=0.0 2023-10-10 12:09:10,048 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.87 vs. limit=10.0 2023-10-10 12:09:27,859 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=15.0 2023-10-10 12:09:54,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=354270.0, ans=0.125 2023-10-10 12:09:55,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=12.0 2023-10-10 12:10:37,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=354456.6666666667, ans=0.0 2023-10-10 12:10:38,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.701e+02 1.962e+02 2.334e+02 3.750e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-10 12:11:02,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=354596.6666666667, ans=0.0 2023-10-10 12:11:07,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=354596.6666666667, ans=0.125 2023-10-10 12:11:30,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=354690.0, ans=0.2 2023-10-10 12:11:31,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=354690.0, ans=0.0 2023-10-10 12:11:38,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-10-10 12:11:51,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-10-10 12:11:57,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=354830.0, ans=0.125 2023-10-10 12:11:58,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=354830.0, ans=0.0 2023-10-10 12:11:59,944 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:12:04,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=354830.0, ans=0.125 2023-10-10 12:12:27,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.754e+02 1.899e+02 2.102e+02 2.940e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-10 12:12:29,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=354970.0, ans=0.0 2023-10-10 12:12:32,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=354970.0, ans=0.0 2023-10-10 12:13:19,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=355156.6666666667, ans=0.0 2023-10-10 12:13:31,821 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:14:03,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=355343.3333333333, ans=0.0 2023-10-10 12:14:16,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=355390.0, ans=0.1 2023-10-10 12:14:17,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=12.0 2023-10-10 12:14:22,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.791e+02 2.064e+02 2.304e+02 3.376e+02, threshold=4.127e+02, percent-clipped=0.0 2023-10-10 12:14:39,585 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.39 vs. limit=10.0 2023-10-10 12:14:49,198 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-10-10 12:14:55,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=355530.0, ans=0.0 2023-10-10 12:14:57,155 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-10-10 12:15:21,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=355623.3333333333, ans=0.125 2023-10-10 12:15:22,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=355623.3333333333, ans=0.125 2023-10-10 12:15:34,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=355670.0, ans=0.2 2023-10-10 12:15:39,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=355716.6666666667, ans=0.0 2023-10-10 12:15:44,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=355716.6666666667, ans=0.1 2023-10-10 12:15:55,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=355763.3333333333, ans=0.125 2023-10-10 12:15:55,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=355763.3333333333, ans=0.125 2023-10-10 12:15:59,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=355810.0, ans=0.2 2023-10-10 12:16:02,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=355810.0, ans=0.0 2023-10-10 12:16:03,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=355810.0, ans=0.125 2023-10-10 12:16:19,174 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.655e+02 1.902e+02 2.362e+02 3.607e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-10 12:16:25,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-10-10 12:16:32,327 INFO [train.py:1031] (0/4) Epoch 6, batch 8000, loss[loss=0.2162, simple_loss=0.3083, pruned_loss=0.06206, over 15304.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3085, pruned_loss=0.0699, over 32228798.57 frames. ], batch size: 35, lr: 6.23e-03, grad_scale: 32.0 2023-10-10 12:16:38,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=355950.0, ans=0.125 2023-10-10 12:16:40,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=355950.0, ans=0.125 2023-10-10 12:16:53,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=356043.3333333333, ans=0.125 2023-10-10 12:17:02,640 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.06 vs. limit=22.5 2023-10-10 12:17:06,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=356090.0, ans=0.2 2023-10-10 12:17:09,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=356090.0, ans=0.1 2023-10-10 12:17:30,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.92 vs. limit=15.0 2023-10-10 12:17:59,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=356323.3333333333, ans=0.0 2023-10-10 12:18:03,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=356323.3333333333, ans=0.0 2023-10-10 12:18:06,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.694e+02 1.912e+02 2.221e+02 3.501e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-10 12:18:07,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.87 vs. limit=22.5 2023-10-10 12:18:22,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=356416.6666666667, ans=0.0 2023-10-10 12:18:32,597 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-10-10 12:18:40,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=356510.0, ans=0.125 2023-10-10 12:19:31,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=356650.0, ans=0.0 2023-10-10 12:19:32,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=356650.0, ans=0.125 2023-10-10 12:19:33,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=356696.6666666667, ans=0.125 2023-10-10 12:19:46,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=356696.6666666667, ans=0.0 2023-10-10 12:20:03,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=356790.0, ans=0.125 2023-10-10 12:20:06,010 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-10-10 12:20:08,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2023-10-10 12:20:10,601 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.662e+02 1.858e+02 2.088e+02 3.872e+02, threshold=3.716e+02, percent-clipped=1.0 2023-10-10 12:20:32,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=356883.3333333333, ans=0.0 2023-10-10 12:20:37,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=356930.0, ans=0.125 2023-10-10 12:20:37,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=356930.0, ans=0.2 2023-10-10 12:20:55,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=357023.3333333333, ans=0.09899494936611666 2023-10-10 12:20:58,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=357023.3333333333, ans=0.0 2023-10-10 12:21:02,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=357023.3333333333, ans=0.1 2023-10-10 12:21:05,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.22 vs. limit=22.5 2023-10-10 12:21:15,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=12.0 2023-10-10 12:21:18,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=357116.6666666667, ans=0.0 2023-10-10 12:21:20,640 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-10-10 12:21:21,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=357116.6666666667, ans=0.125 2023-10-10 12:21:29,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=357163.3333333333, ans=0.2 2023-10-10 12:21:33,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=357163.3333333333, ans=0.2 2023-10-10 12:21:48,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=357256.6666666667, ans=0.125 2023-10-10 12:21:50,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357256.6666666667, ans=0.1 2023-10-10 12:21:56,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357256.6666666667, ans=0.1 2023-10-10 12:21:58,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.271e+02 1.775e+02 2.039e+02 2.521e+02 4.705e+02, threshold=4.079e+02, percent-clipped=3.0 2023-10-10 12:22:26,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=357396.6666666667, ans=0.0 2023-10-10 12:22:32,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=357396.6666666667, ans=0.07 2023-10-10 12:22:49,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=357490.0, ans=0.0 2023-10-10 12:22:49,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=357490.0, ans=0.09899494936611666 2023-10-10 12:22:50,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=357490.0, ans=0.125 2023-10-10 12:22:57,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=357490.0, ans=0.04949747468305833 2023-10-10 12:23:02,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=357536.6666666667, ans=0.0 2023-10-10 12:23:14,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.85 vs. limit=22.5 2023-10-10 12:23:49,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=357723.3333333333, ans=0.0 2023-10-10 12:23:52,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=12.0 2023-10-10 12:23:53,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.805e+02 2.040e+02 2.368e+02 3.381e+02, threshold=4.080e+02, percent-clipped=0.0 2023-10-10 12:24:01,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=357770.0, ans=0.05 2023-10-10 12:24:08,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=357816.6666666667, ans=0.0 2023-10-10 12:24:13,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357816.6666666667, ans=0.1 2023-10-10 12:24:25,796 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:24:37,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=357910.0, ans=0.0 2023-10-10 12:24:37,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=22.5 2023-10-10 12:24:46,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=357956.6666666667, ans=0.125 2023-10-10 12:24:49,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357956.6666666667, ans=0.1 2023-10-10 12:24:50,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=357956.6666666667, ans=0.0 2023-10-10 12:25:14,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=358050.0, ans=0.125 2023-10-10 12:25:48,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=358190.0, ans=0.125 2023-10-10 12:25:51,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-10-10 12:25:51,687 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.670e+02 1.880e+02 2.066e+02 3.321e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-10 12:25:56,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=358236.6666666667, ans=0.0 2023-10-10 12:26:00,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=358236.6666666667, ans=0.0 2023-10-10 12:26:07,580 INFO [train.py:1031] (0/4) Epoch 6, batch 8500, loss[loss=0.2226, simple_loss=0.3056, pruned_loss=0.06979, over 16634.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3088, pruned_loss=0.0698, over 32360264.96 frames. ], batch size: 61, lr: 6.21e-03, grad_scale: 32.0 2023-10-10 12:26:30,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=358376.6666666667, ans=0.2 2023-10-10 12:26:34,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=358376.6666666667, ans=0.0 2023-10-10 12:26:44,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=358423.3333333333, ans=0.125 2023-10-10 12:26:44,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=358423.3333333333, ans=0.2 2023-10-10 12:26:59,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=358470.0, ans=0.0 2023-10-10 12:27:11,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.18 vs. limit=15.0 2023-10-10 12:27:31,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358610.0, ans=0.1 2023-10-10 12:27:45,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.857e+02 2.075e+02 2.345e+02 3.260e+02, threshold=4.150e+02, percent-clipped=0.0 2023-10-10 12:28:24,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=358796.6666666667, ans=0.125 2023-10-10 12:28:33,320 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-10-10 12:28:35,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358843.3333333333, ans=0.1 2023-10-10 12:28:38,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=358890.0, ans=0.125 2023-10-10 12:28:39,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.38 vs. limit=15.0 2023-10-10 12:28:48,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=358890.0, ans=0.0 2023-10-10 12:28:48,310 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.37 vs. limit=22.5 2023-10-10 12:29:17,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=359030.0, ans=0.1 2023-10-10 12:29:25,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=359076.6666666667, ans=0.0 2023-10-10 12:29:48,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.662e+02 1.861e+02 2.157e+02 3.147e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-10 12:29:50,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=359170.0, ans=0.0 2023-10-10 12:30:13,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=359216.6666666667, ans=0.1 2023-10-10 12:30:18,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=359263.3333333333, ans=0.125 2023-10-10 12:30:21,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=359263.3333333333, ans=0.2 2023-10-10 12:30:29,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=359310.0, ans=0.05 2023-10-10 12:30:47,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=359356.6666666667, ans=0.125 2023-10-10 12:30:47,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.77 vs. limit=22.5 2023-10-10 12:30:50,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=359403.3333333333, ans=0.0 2023-10-10 12:31:01,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=359450.0, ans=0.125 2023-10-10 12:31:10,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=359450.0, ans=0.0 2023-10-10 12:31:20,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=359496.6666666667, ans=0.2 2023-10-10 12:31:33,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=359543.3333333333, ans=0.125 2023-10-10 12:31:39,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=359590.0, ans=0.0 2023-10-10 12:31:49,830 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.686e+02 1.989e+02 2.251e+02 3.842e+02, threshold=3.977e+02, percent-clipped=2.0 2023-10-10 12:32:27,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=359776.6666666667, ans=0.125 2023-10-10 12:32:31,995 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2023-10-10 12:32:40,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=359823.3333333333, ans=0.2 2023-10-10 12:32:46,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=359870.0, ans=0.125 2023-10-10 12:32:56,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=359916.6666666667, ans=0.04949747468305833 2023-10-10 12:33:00,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=359916.6666666667, ans=0.1 2023-10-10 12:33:08,489 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.47 vs. limit=15.0 2023-10-10 12:33:12,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=359963.3333333333, ans=0.2 2023-10-10 12:33:23,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=360010.0, ans=0.125 2023-10-10 12:33:29,477 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.98 vs. limit=22.5 2023-10-10 12:33:34,489 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.713e+02 1.890e+02 2.176e+02 3.589e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-10 12:33:39,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=360103.3333333333, ans=0.2 2023-10-10 12:34:00,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=360196.6666666667, ans=0.1 2023-10-10 12:34:23,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=360290.0, ans=0.125 2023-10-10 12:34:30,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=360336.6666666667, ans=0.0 2023-10-10 12:35:02,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360476.6666666667, ans=0.1 2023-10-10 12:35:22,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.729e+02 1.901e+02 2.175e+02 3.364e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-10 12:35:35,233 INFO [train.py:1031] (0/4) Epoch 6, batch 9000, loss[loss=0.2221, simple_loss=0.3077, pruned_loss=0.06825, over 16910.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3081, pruned_loss=0.06952, over 32476550.52 frames. ], batch size: 72, lr: 6.19e-03, grad_scale: 32.0 2023-10-10 12:35:38,298 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-10-10 12:35:43,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=360616.6666666667, ans=0.125 2023-10-10 12:35:49,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=360663.3333333333, ans=0.1 2023-10-10 12:35:51,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=360663.3333333333, ans=0.1 2023-10-10 12:36:17,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=360803.3333333333, ans=0.125 2023-10-10 12:36:33,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=360850.0, ans=0.125 2023-10-10 12:36:37,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=360850.0, ans=0.0 2023-10-10 12:36:38,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=360896.6666666667, ans=0.1 2023-10-10 12:36:45,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=360896.6666666667, ans=0.125 2023-10-10 12:37:03,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=360990.0, ans=0.125 2023-10-10 12:37:06,418 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.54 vs. limit=15.0 2023-10-10 12:37:09,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.717e+02 1.886e+02 2.114e+02 3.208e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-10 12:37:25,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=361083.3333333333, ans=0.125 2023-10-10 12:37:39,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=361130.0, ans=0.0 2023-10-10 12:37:51,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=361223.3333333333, ans=0.2 2023-10-10 12:37:54,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=361223.3333333333, ans=0.125 2023-10-10 12:37:59,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.57 vs. limit=15.0 2023-10-10 12:38:15,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361316.6666666667, ans=0.1 2023-10-10 12:38:20,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=361316.6666666667, ans=0.125 2023-10-10 12:38:23,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=361363.3333333333, ans=0.125 2023-10-10 12:38:23,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=361363.3333333333, ans=0.09899494936611666 2023-10-10 12:38:52,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.778e+02 1.956e+02 2.221e+02 2.884e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-10 12:39:10,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.36 vs. limit=10.0 2023-10-10 12:39:14,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=361596.6666666667, ans=0.0 2023-10-10 12:39:18,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=361596.6666666667, ans=0.07 2023-10-10 12:39:36,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=361690.0, ans=0.125 2023-10-10 12:39:59,396 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:40:03,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=361783.3333333333, ans=0.0 2023-10-10 12:40:04,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=361783.3333333333, ans=0.125 2023-10-10 12:40:14,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-10-10 12:40:18,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=361876.6666666667, ans=0.125 2023-10-10 12:40:34,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=361923.3333333333, ans=0.1 2023-10-10 12:40:36,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.701e+02 1.935e+02 2.237e+02 4.476e+02, threshold=3.870e+02, percent-clipped=1.0 2023-10-10 12:40:40,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=361970.0, ans=0.125 2023-10-10 12:40:54,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=362016.6666666667, ans=0.125 2023-10-10 12:41:06,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=362063.3333333333, ans=0.2 2023-10-10 12:41:12,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=362110.0, ans=0.1 2023-10-10 12:41:14,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=362110.0, ans=0.125 2023-10-10 12:41:42,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=362203.3333333333, ans=0.95 2023-10-10 12:41:59,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=362296.6666666667, ans=0.0 2023-10-10 12:42:16,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=362343.3333333333, ans=0.125 2023-10-10 12:42:27,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=362390.0, ans=0.0 2023-10-10 12:42:28,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=362390.0, ans=0.125 2023-10-10 12:42:34,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.818e+02 2.005e+02 2.455e+02 5.432e+02, threshold=4.010e+02, percent-clipped=3.0 2023-10-10 12:42:37,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=362436.6666666667, ans=0.125 2023-10-10 12:42:40,500 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-10-10 12:42:54,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=362483.3333333333, ans=0.125 2023-10-10 12:42:57,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=362530.0, ans=0.125 2023-10-10 12:42:59,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=362530.0, ans=0.125 2023-10-10 12:43:22,081 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.39 vs. limit=15.0 2023-10-10 12:44:09,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=362810.0, ans=0.2 2023-10-10 12:44:28,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.703e+02 1.874e+02 2.099e+02 3.390e+02, threshold=3.748e+02, percent-clipped=0.0 2023-10-10 12:44:36,418 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-10-10 12:44:39,775 INFO [train.py:1031] (0/4) Epoch 6, batch 9500, loss[loss=0.2501, simple_loss=0.335, pruned_loss=0.08254, over 16700.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3089, pruned_loss=0.0699, over 32551400.19 frames. ], batch size: 202, lr: 6.17e-03, grad_scale: 32.0 2023-10-10 12:44:48,078 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-10-10 12:44:58,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=362996.6666666667, ans=0.125 2023-10-10 12:45:03,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=363043.3333333333, ans=0.125 2023-10-10 12:45:06,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=363043.3333333333, ans=0.125 2023-10-10 12:45:17,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=363090.0, ans=0.125 2023-10-10 12:45:18,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=363090.0, ans=0.0 2023-10-10 12:45:31,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=363136.6666666667, ans=0.0 2023-10-10 12:45:34,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=363183.3333333333, ans=0.125 2023-10-10 12:45:58,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=363276.6666666667, ans=0.0 2023-10-10 12:46:10,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=363323.3333333333, ans=0.125 2023-10-10 12:46:15,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.642e+02 1.802e+02 2.142e+02 3.389e+02, threshold=3.604e+02, percent-clipped=0.0 2023-10-10 12:46:16,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363370.0, ans=0.1 2023-10-10 12:46:23,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363370.0, ans=0.1 2023-10-10 12:46:49,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=363463.3333333333, ans=0.09899494936611666 2023-10-10 12:46:58,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.36 vs. limit=15.0 2023-10-10 12:47:24,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.02 vs. limit=22.5 2023-10-10 12:47:30,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=363650.0, ans=0.2 2023-10-10 12:47:31,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.02 vs. limit=22.5 2023-10-10 12:47:36,795 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.13 vs. limit=15.0 2023-10-10 12:47:42,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=363696.6666666667, ans=0.125 2023-10-10 12:47:44,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=363696.6666666667, ans=0.125 2023-10-10 12:47:48,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=363743.3333333333, ans=0.125 2023-10-10 12:47:49,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=363743.3333333333, ans=0.125 2023-10-10 12:47:54,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2023-10-10 12:48:08,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.726e+02 1.940e+02 2.105e+02 3.124e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-10 12:48:25,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=363883.3333333333, ans=0.2 2023-10-10 12:48:35,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=363930.0, ans=0.025 2023-10-10 12:48:35,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=363930.0, ans=0.0 2023-10-10 12:48:40,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=363930.0, ans=0.0 2023-10-10 12:48:44,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=363976.6666666667, ans=0.015 2023-10-10 12:48:57,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364023.3333333333, ans=0.1 2023-10-10 12:49:17,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=364116.6666666667, ans=0.125 2023-10-10 12:49:26,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=364163.3333333333, ans=0.07 2023-10-10 12:49:27,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=364163.3333333333, ans=10.0 2023-10-10 12:49:27,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364163.3333333333, ans=0.1 2023-10-10 12:49:57,661 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.687e+02 1.826e+02 2.048e+02 3.041e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-10 12:49:59,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=364303.3333333333, ans=0.125 2023-10-10 12:50:08,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=364350.0, ans=0.0 2023-10-10 12:50:12,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=364350.0, ans=0.0 2023-10-10 12:50:16,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.60 vs. limit=15.0 2023-10-10 12:50:16,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=364350.0, ans=10.0 2023-10-10 12:50:25,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=364396.6666666667, ans=0.125 2023-10-10 12:50:27,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=364396.6666666667, ans=0.125 2023-10-10 12:50:41,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=364490.0, ans=0.125 2023-10-10 12:50:47,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=364490.0, ans=0.0 2023-10-10 12:50:58,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=364536.6666666667, ans=0.0 2023-10-10 12:51:02,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=364536.6666666667, ans=0.125 2023-10-10 12:51:19,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=364630.0, ans=0.125 2023-10-10 12:51:19,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=364630.0, ans=0.125 2023-10-10 12:51:22,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=364630.0, ans=0.1 2023-10-10 12:51:27,695 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 12:51:34,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=364723.3333333333, ans=0.125 2023-10-10 12:51:43,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=364723.3333333333, ans=0.05 2023-10-10 12:51:44,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=364723.3333333333, ans=0.125 2023-10-10 12:51:47,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.689e+02 1.900e+02 2.177e+02 3.254e+02, threshold=3.800e+02, percent-clipped=0.0 2023-10-10 12:52:08,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=364863.3333333333, ans=0.125 2023-10-10 12:52:19,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=364910.0, ans=0.2 2023-10-10 12:52:20,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.51 vs. limit=15.0 2023-10-10 12:52:28,613 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-10-10 12:52:29,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=364956.6666666667, ans=0.015 2023-10-10 12:52:44,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=365003.3333333333, ans=0.125 2023-10-10 12:52:57,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=365050.0, ans=0.0 2023-10-10 12:53:19,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=365143.3333333333, ans=0.0 2023-10-10 12:53:26,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=365190.0, ans=0.1 2023-10-10 12:53:31,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.291e+02 1.734e+02 2.013e+02 2.347e+02 3.330e+02, threshold=4.025e+02, percent-clipped=0.0 2023-10-10 12:53:41,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=365236.6666666667, ans=0.125 2023-10-10 12:53:42,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.58 vs. limit=10.0 2023-10-10 12:53:42,682 INFO [train.py:1031] (0/4) Epoch 6, batch 10000, loss[loss=0.2237, simple_loss=0.3071, pruned_loss=0.07013, over 16844.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.308, pruned_loss=0.06945, over 32646911.19 frames. ], batch size: 188, lr: 6.15e-03, grad_scale: 32.0 2023-10-10 12:53:58,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=365330.0, ans=0.2 2023-10-10 12:54:01,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-10 12:54:01,692 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.80 vs. limit=22.5 2023-10-10 12:54:03,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=365376.6666666667, ans=0.1 2023-10-10 12:54:09,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=365376.6666666667, ans=0.125 2023-10-10 12:55:11,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=365656.6666666667, ans=0.125 2023-10-10 12:55:12,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=365656.6666666667, ans=0.2 2023-10-10 12:55:20,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.742e+02 1.931e+02 2.160e+02 3.075e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-10 12:55:24,471 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.40 vs. limit=15.0 2023-10-10 12:55:29,573 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.34 vs. limit=22.5 2023-10-10 12:55:59,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=365843.3333333333, ans=0.125 2023-10-10 12:56:03,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=365843.3333333333, ans=0.1 2023-10-10 12:56:11,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=365890.0, ans=0.5 2023-10-10 12:56:15,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=365936.6666666667, ans=0.125 2023-10-10 12:56:15,227 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.79 vs. limit=15.0 2023-10-10 12:56:16,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=365936.6666666667, ans=0.0 2023-10-10 12:56:38,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=366030.0, ans=0.125 2023-10-10 12:57:00,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=366123.3333333333, ans=0.0 2023-10-10 12:57:05,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=366123.3333333333, ans=0.95 2023-10-10 12:57:06,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.762e+02 2.037e+02 2.354e+02 3.380e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-10 12:57:10,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366170.0, ans=0.1 2023-10-10 12:57:18,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=366216.6666666667, ans=0.125 2023-10-10 12:57:28,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=366216.6666666667, ans=0.125 2023-10-10 12:57:37,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=366263.3333333333, ans=0.0 2023-10-10 12:57:54,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=366310.0, ans=0.95 2023-10-10 12:57:55,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=366310.0, ans=0.0 2023-10-10 12:58:36,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366496.6666666667, ans=0.1 2023-10-10 12:58:42,020 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.37 vs. limit=10.0 2023-10-10 12:58:53,859 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-10-10 12:59:05,157 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.657e+02 1.828e+02 2.102e+02 3.735e+02, threshold=3.656e+02, percent-clipped=0.0 2023-10-10 12:59:07,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=366636.6666666667, ans=0.04949747468305833 2023-10-10 12:59:22,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=366683.3333333333, ans=0.0 2023-10-10 12:59:27,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.91 vs. limit=15.0 2023-10-10 12:59:41,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=366776.6666666667, ans=0.125 2023-10-10 12:59:43,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=366776.6666666667, ans=0.0 2023-10-10 12:59:46,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=366776.6666666667, ans=0.125 2023-10-10 12:59:51,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366823.3333333333, ans=0.1 2023-10-10 12:59:54,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=366823.3333333333, ans=0.2 2023-10-10 13:00:00,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=366823.3333333333, ans=0.0 2023-10-10 13:00:06,269 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:00:23,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=366916.6666666667, ans=0.0 2023-10-10 13:00:27,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=366963.3333333333, ans=0.0 2023-10-10 13:00:43,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.71 vs. limit=15.0 2023-10-10 13:00:50,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=367056.6666666667, ans=0.2 2023-10-10 13:00:50,573 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.39 vs. limit=15.0 2023-10-10 13:00:56,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.305e+02 1.637e+02 1.786e+02 1.945e+02 3.199e+02, threshold=3.572e+02, percent-clipped=0.0 2023-10-10 13:00:59,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=367103.3333333333, ans=0.0 2023-10-10 13:01:05,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=367103.3333333333, ans=0.125 2023-10-10 13:01:23,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=367196.6666666667, ans=0.0 2023-10-10 13:01:29,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=367196.6666666667, ans=0.125 2023-10-10 13:01:31,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=367196.6666666667, ans=0.125 2023-10-10 13:01:52,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=367290.0, ans=0.125 2023-10-10 13:02:01,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=367336.6666666667, ans=0.2 2023-10-10 13:02:13,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=367383.3333333333, ans=0.125 2023-10-10 13:02:15,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=12.0 2023-10-10 13:02:27,562 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.44 vs. limit=15.0 2023-10-10 13:02:37,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=367476.6666666667, ans=0.0 2023-10-10 13:02:44,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=367523.3333333333, ans=0.125 2023-10-10 13:02:48,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-10-10 13:02:49,687 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.47 vs. limit=15.0 2023-10-10 13:02:51,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.737e+02 1.918e+02 2.197e+02 3.372e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-10 13:02:59,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=367570.0, ans=0.2 2023-10-10 13:03:01,422 INFO [train.py:1031] (0/4) Epoch 6, batch 10500, loss[loss=0.2124, simple_loss=0.302, pruned_loss=0.06139, over 16826.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3081, pruned_loss=0.06948, over 32677668.45 frames. ], batch size: 188, lr: 6.14e-03, grad_scale: 32.0 2023-10-10 13:03:09,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=367616.6666666667, ans=0.125 2023-10-10 13:03:16,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=367663.3333333333, ans=0.125 2023-10-10 13:03:18,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=367663.3333333333, ans=0.0 2023-10-10 13:03:19,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=367663.3333333333, ans=0.0 2023-10-10 13:03:22,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.12 vs. limit=15.0 2023-10-10 13:03:47,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=367803.3333333333, ans=0.0 2023-10-10 13:03:54,813 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=22.5 2023-10-10 13:03:55,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=367850.0, ans=0.2 2023-10-10 13:04:00,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=367850.0, ans=0.125 2023-10-10 13:04:03,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=367896.6666666667, ans=0.1 2023-10-10 13:04:14,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=367943.3333333333, ans=0.2 2023-10-10 13:04:29,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.82 vs. limit=15.0 2023-10-10 13:04:42,867 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.728e+02 1.994e+02 2.242e+02 4.006e+02, threshold=3.989e+02, percent-clipped=1.0 2023-10-10 13:04:46,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=368036.6666666667, ans=0.125 2023-10-10 13:05:02,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=368083.3333333333, ans=0.125 2023-10-10 13:05:08,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-10-10 13:05:16,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368176.6666666667, ans=0.1 2023-10-10 13:05:18,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=368176.6666666667, ans=0.0 2023-10-10 13:05:31,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.34 vs. limit=15.0 2023-10-10 13:05:51,755 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-10-10 13:06:16,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=368410.0, ans=0.0 2023-10-10 13:06:34,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=368456.6666666667, ans=0.95 2023-10-10 13:06:36,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.731e+02 1.969e+02 2.237e+02 3.458e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-10 13:06:41,198 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:07:40,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=368783.3333333333, ans=0.2 2023-10-10 13:07:50,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=368830.0, ans=0.125 2023-10-10 13:08:24,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.886e+02 2.129e+02 2.476e+02 3.785e+02, threshold=4.257e+02, percent-clipped=0.0 2023-10-10 13:08:36,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=369016.6666666667, ans=0.2 2023-10-10 13:08:44,801 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:08:56,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=369110.0, ans=0.125 2023-10-10 13:08:59,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=369110.0, ans=0.0 2023-10-10 13:09:07,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=369156.6666666667, ans=0.125 2023-10-10 13:09:11,559 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.97 vs. limit=10.0 2023-10-10 13:09:37,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=369296.6666666667, ans=0.125 2023-10-10 13:09:50,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=369343.3333333333, ans=0.125 2023-10-10 13:09:55,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=369343.3333333333, ans=0.0 2023-10-10 13:09:58,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=369343.3333333333, ans=0.0 2023-10-10 13:10:15,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.217e+02 1.749e+02 1.914e+02 2.119e+02 3.101e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-10 13:10:18,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=369436.6666666667, ans=0.125 2023-10-10 13:10:33,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=22.5 2023-10-10 13:10:39,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=369530.0, ans=0.125 2023-10-10 13:11:05,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=369623.3333333333, ans=0.125 2023-10-10 13:11:09,762 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-10-10 13:11:18,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=369716.6666666667, ans=0.0 2023-10-10 13:11:26,087 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=22.5 2023-10-10 13:11:27,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=369716.6666666667, ans=0.2 2023-10-10 13:11:37,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=369763.3333333333, ans=0.125 2023-10-10 13:11:44,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369810.0, ans=0.1 2023-10-10 13:11:45,276 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=15.0 2023-10-10 13:12:02,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.250e+02 1.716e+02 2.021e+02 2.413e+02 3.624e+02, threshold=4.042e+02, percent-clipped=0.0 2023-10-10 13:12:07,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=369903.3333333333, ans=0.0 2023-10-10 13:12:11,167 INFO [train.py:1031] (0/4) Epoch 6, batch 11000, loss[loss=0.2473, simple_loss=0.3351, pruned_loss=0.07973, over 16582.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.308, pruned_loss=0.06946, over 32700704.01 frames. ], batch size: 241, lr: 6.12e-03, grad_scale: 16.0 2023-10-10 13:12:11,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=12.0 2023-10-10 13:12:17,765 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.56 vs. limit=15.0 2023-10-10 13:12:21,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=369950.0, ans=0.0 2023-10-10 13:12:25,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369996.6666666667, ans=0.1 2023-10-10 13:12:26,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=369996.6666666667, ans=0.125 2023-10-10 13:12:29,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2023-10-10 13:12:39,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=370043.3333333333, ans=0.125 2023-10-10 13:12:48,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=370090.0, ans=0.125 2023-10-10 13:13:06,447 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.53 vs. limit=10.0 2023-10-10 13:13:19,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=370230.0, ans=0.125 2023-10-10 13:13:36,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=370276.6666666667, ans=0.125 2023-10-10 13:13:37,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=370276.6666666667, ans=0.95 2023-10-10 13:13:45,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=370323.3333333333, ans=0.2 2023-10-10 13:13:46,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=370323.3333333333, ans=0.0 2023-10-10 13:13:52,476 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.716e+02 1.912e+02 2.182e+02 2.801e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-10 13:13:59,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=370370.0, ans=0.125 2023-10-10 13:13:59,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=370370.0, ans=0.0 2023-10-10 13:13:59,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.80 vs. limit=6.0 2023-10-10 13:14:12,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=370416.6666666667, ans=0.125 2023-10-10 13:14:19,812 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.77 vs. limit=15.0 2023-10-10 13:14:21,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.42 vs. limit=22.5 2023-10-10 13:14:29,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=370510.0, ans=0.0 2023-10-10 13:14:34,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=370510.0, ans=0.07 2023-10-10 13:14:35,035 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=22.5 2023-10-10 13:14:35,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=370510.0, ans=0.0 2023-10-10 13:14:35,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=370510.0, ans=0.0 2023-10-10 13:14:57,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=370603.3333333333, ans=0.025 2023-10-10 13:15:11,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.28 vs. limit=15.0 2023-10-10 13:15:20,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=370696.6666666667, ans=0.0 2023-10-10 13:15:22,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=370696.6666666667, ans=0.0 2023-10-10 13:15:22,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=370696.6666666667, ans=0.125 2023-10-10 13:15:47,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.640e+02 1.802e+02 2.034e+02 2.739e+02, threshold=3.604e+02, percent-clipped=0.0 2023-10-10 13:15:48,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=370836.6666666667, ans=0.2 2023-10-10 13:15:48,234 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.21 vs. limit=15.0 2023-10-10 13:16:02,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=370883.3333333333, ans=0.0 2023-10-10 13:16:04,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-10-10 13:16:05,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=370883.3333333333, ans=0.2 2023-10-10 13:16:11,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=370930.0, ans=0.0 2023-10-10 13:16:17,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.62 vs. limit=15.0 2023-10-10 13:16:38,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=371023.3333333333, ans=0.1 2023-10-10 13:16:58,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=371116.6666666667, ans=0.0 2023-10-10 13:17:03,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=371163.3333333333, ans=0.125 2023-10-10 13:17:39,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=371303.3333333333, ans=0.125 2023-10-10 13:17:41,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.247e+02 1.807e+02 1.999e+02 2.343e+02 3.683e+02, threshold=3.997e+02, percent-clipped=1.0 2023-10-10 13:17:52,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=371350.0, ans=0.125 2023-10-10 13:17:55,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=371350.0, ans=0.125 2023-10-10 13:17:55,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=371350.0, ans=0.2 2023-10-10 13:17:55,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=371350.0, ans=0.125 2023-10-10 13:18:03,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=371396.6666666667, ans=0.125 2023-10-10 13:18:32,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=371490.0, ans=0.125 2023-10-10 13:18:35,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.91 vs. limit=10.0 2023-10-10 13:18:45,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=371536.6666666667, ans=0.125 2023-10-10 13:18:59,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-10-10 13:19:29,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=371770.0, ans=0.125 2023-10-10 13:19:31,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.653e+02 1.869e+02 2.128e+02 3.404e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-10 13:19:34,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=371770.0, ans=0.1 2023-10-10 13:19:45,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=371816.6666666667, ans=0.07 2023-10-10 13:20:35,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=372003.3333333333, ans=0.0 2023-10-10 13:20:46,542 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=12.0 2023-10-10 13:20:57,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372096.6666666667, ans=0.1 2023-10-10 13:21:16,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=372190.0, ans=0.125 2023-10-10 13:21:20,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.73 vs. limit=15.0 2023-10-10 13:21:22,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.854e+02 2.045e+02 2.458e+02 3.775e+02, threshold=4.090e+02, percent-clipped=1.0 2023-10-10 13:21:22,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=372236.6666666667, ans=0.0 2023-10-10 13:21:24,491 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:21:28,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=372236.6666666667, ans=0.1 2023-10-10 13:21:32,046 INFO [train.py:1031] (0/4) Epoch 6, batch 11500, loss[loss=0.2386, simple_loss=0.3294, pruned_loss=0.07393, over 16838.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3077, pruned_loss=0.06936, over 32743680.52 frames. ], batch size: 146, lr: 6.10e-03, grad_scale: 32.0 2023-10-10 13:21:34,237 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.30 vs. limit=15.0 2023-10-10 13:22:31,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=372516.6666666667, ans=0.0 2023-10-10 13:22:31,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=372516.6666666667, ans=0.0 2023-10-10 13:22:42,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=372563.3333333333, ans=0.125 2023-10-10 13:23:13,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=372656.6666666667, ans=0.2 2023-10-10 13:23:16,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.663e+02 1.941e+02 2.216e+02 3.108e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-10 13:23:37,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=372750.0, ans=0.125 2023-10-10 13:23:37,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=372750.0, ans=0.2 2023-10-10 13:23:45,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-10-10 13:23:58,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.37 vs. limit=15.0 2023-10-10 13:24:43,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=373030.0, ans=15.0 2023-10-10 13:25:05,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.264e+02 1.595e+02 1.777e+02 2.084e+02 3.083e+02, threshold=3.553e+02, percent-clipped=0.0 2023-10-10 13:25:21,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=373216.6666666667, ans=0.125 2023-10-10 13:25:26,424 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.18 vs. limit=10.0 2023-10-10 13:25:28,825 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:25:35,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=373310.0, ans=10.0 2023-10-10 13:25:36,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=373310.0, ans=0.0 2023-10-10 13:25:40,161 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-80000.pt 2023-10-10 13:25:52,031 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.31 vs. limit=10.0 2023-10-10 13:25:52,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=373356.6666666667, ans=0.0 2023-10-10 13:26:14,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=373450.0, ans=0.125 2023-10-10 13:26:37,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=373496.6666666667, ans=0.0 2023-10-10 13:27:08,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.664e+02 1.787e+02 1.993e+02 3.081e+02, threshold=3.575e+02, percent-clipped=0.0 2023-10-10 13:27:08,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.67 vs. limit=15.0 2023-10-10 13:27:19,972 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-10 13:27:22,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=373683.3333333333, ans=0.0 2023-10-10 13:27:36,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=373730.0, ans=0.2 2023-10-10 13:27:40,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=373776.6666666667, ans=0.025 2023-10-10 13:27:43,569 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-10-10 13:28:08,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=373870.0, ans=0.1 2023-10-10 13:28:20,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=373916.6666666667, ans=0.05 2023-10-10 13:28:53,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-10-10 13:29:02,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=374056.6666666667, ans=0.0 2023-10-10 13:29:06,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.777e+02 2.003e+02 2.440e+02 3.027e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-10 13:29:14,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=374103.3333333333, ans=0.0 2023-10-10 13:29:16,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=374150.0, ans=0.09899494936611666 2023-10-10 13:29:30,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=374196.6666666667, ans=0.125 2023-10-10 13:30:29,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=374430.0, ans=0.5 2023-10-10 13:30:31,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=374476.6666666667, ans=0.125 2023-10-10 13:30:32,956 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.03 vs. limit=10.0 2023-10-10 13:30:40,742 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.82 vs. limit=22.5 2023-10-10 13:30:47,839 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.01 vs. limit=6.0 2023-10-10 13:30:49,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=374523.3333333333, ans=0.125 2023-10-10 13:30:50,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=374523.3333333333, ans=0.125 2023-10-10 13:30:51,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=374523.3333333333, ans=0.0 2023-10-10 13:30:54,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.299e+02 1.656e+02 1.858e+02 2.088e+02 2.828e+02, threshold=3.716e+02, percent-clipped=0.0 2023-10-10 13:31:04,977 INFO [train.py:1031] (0/4) Epoch 6, batch 12000, loss[loss=0.2265, simple_loss=0.3069, pruned_loss=0.073, over 16381.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3075, pruned_loss=0.06905, over 32761872.27 frames. ], batch size: 50, lr: 6.08e-03, grad_scale: 32.0 2023-10-10 13:31:17,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=374663.3333333333, ans=0.125 2023-10-10 13:31:21,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=374663.3333333333, ans=0.04949747468305833 2023-10-10 13:31:36,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=374756.6666666667, ans=0.0 2023-10-10 13:31:41,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.50 vs. limit=15.0 2023-10-10 13:32:04,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=374850.0, ans=0.0 2023-10-10 13:32:17,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=374896.6666666667, ans=0.2 2023-10-10 13:32:19,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=374896.6666666667, ans=0.125 2023-10-10 13:32:42,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=375036.6666666667, ans=0.125 2023-10-10 13:32:45,043 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.758e+02 2.001e+02 2.500e+02 3.638e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-10 13:32:50,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.86 vs. limit=10.0 2023-10-10 13:32:56,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=375083.3333333333, ans=0.125 2023-10-10 13:33:00,362 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.93 vs. limit=15.0 2023-10-10 13:33:05,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=375130.0, ans=0.125 2023-10-10 13:33:07,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-10-10 13:33:10,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=375130.0, ans=0.0 2023-10-10 13:33:12,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=375130.0, ans=0.2 2023-10-10 13:33:12,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-10-10 13:33:13,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=375176.6666666667, ans=0.0 2023-10-10 13:33:18,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=375176.6666666667, ans=22.5 2023-10-10 13:33:20,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=375176.6666666667, ans=0.1 2023-10-10 13:33:29,643 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:33:30,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=375223.3333333333, ans=0.025 2023-10-10 13:33:37,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.23 vs. limit=22.5 2023-10-10 13:33:44,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=375316.6666666667, ans=0.125 2023-10-10 13:33:51,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.69 vs. limit=10.0 2023-10-10 13:33:59,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=375363.3333333333, ans=0.1 2023-10-10 13:33:59,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=375363.3333333333, ans=0.1 2023-10-10 13:34:00,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=375363.3333333333, ans=0.125 2023-10-10 13:34:09,071 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.05 vs. limit=15.0 2023-10-10 13:34:15,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=375456.6666666667, ans=0.0 2023-10-10 13:34:28,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.736e+02 2.041e+02 2.376e+02 4.195e+02, threshold=4.081e+02, percent-clipped=1.0 2023-10-10 13:34:30,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=375503.3333333333, ans=0.125 2023-10-10 13:35:08,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=12.0 2023-10-10 13:35:10,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.61 vs. limit=15.0 2023-10-10 13:35:10,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.19 vs. limit=22.5 2023-10-10 13:35:11,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=375690.0, ans=0.0 2023-10-10 13:35:16,673 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.15 vs. limit=22.5 2023-10-10 13:35:18,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=375690.0, ans=0.125 2023-10-10 13:35:21,230 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.85 vs. limit=15.0 2023-10-10 13:35:21,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=375736.6666666667, ans=0.125 2023-10-10 13:35:48,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=375830.0, ans=0.1 2023-10-10 13:35:57,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=375876.6666666667, ans=0.07 2023-10-10 13:36:18,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.869e+02 2.159e+02 2.474e+02 4.078e+02, threshold=4.318e+02, percent-clipped=0.0 2023-10-10 13:36:32,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=376016.6666666667, ans=0.125 2023-10-10 13:36:54,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-10 13:37:32,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=376296.6666666667, ans=0.125 2023-10-10 13:37:35,380 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.63 vs. limit=22.5 2023-10-10 13:37:41,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=376296.6666666667, ans=0.0 2023-10-10 13:37:54,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=376343.3333333333, ans=0.0 2023-10-10 13:38:03,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=376390.0, ans=0.0 2023-10-10 13:38:09,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.718e+02 2.011e+02 2.395e+02 3.700e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-10 13:38:13,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=376436.6666666667, ans=0.0 2023-10-10 13:38:27,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=376483.3333333333, ans=0.07 2023-10-10 13:38:36,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=376530.0, ans=0.0 2023-10-10 13:38:37,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=376530.0, ans=0.0 2023-10-10 13:38:47,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=376576.6666666667, ans=0.1 2023-10-10 13:38:55,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=376623.3333333333, ans=0.125 2023-10-10 13:38:59,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=376623.3333333333, ans=0.0 2023-10-10 13:39:02,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=376670.0, ans=0.1 2023-10-10 13:39:07,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.25 vs. limit=15.0 2023-10-10 13:39:21,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=376716.6666666667, ans=0.125 2023-10-10 13:39:39,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-10-10 13:39:42,096 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=12.0 2023-10-10 13:40:00,483 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.772e+02 1.997e+02 2.390e+02 3.375e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-10 13:40:02,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=376903.3333333333, ans=0.0 2023-10-10 13:40:06,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=376903.3333333333, ans=0.2 2023-10-10 13:40:08,600 INFO [train.py:1031] (0/4) Epoch 6, batch 12500, loss[loss=0.2181, simple_loss=0.3012, pruned_loss=0.06753, over 16635.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3071, pruned_loss=0.06908, over 32770104.76 frames. ], batch size: 56, lr: 6.06e-03, grad_scale: 32.0 2023-10-10 13:40:19,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=376996.6666666667, ans=22.5 2023-10-10 13:40:20,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=376996.6666666667, ans=0.125 2023-10-10 13:40:31,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=377043.3333333333, ans=0.0 2023-10-10 13:40:39,961 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=15.0 2023-10-10 13:40:59,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=377183.3333333333, ans=0.0 2023-10-10 13:41:05,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=377183.3333333333, ans=0.125 2023-10-10 13:41:44,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.663e+02 1.795e+02 2.126e+02 3.091e+02, threshold=3.590e+02, percent-clipped=0.0 2023-10-10 13:41:48,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=377370.0, ans=0.5 2023-10-10 13:41:52,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.83 vs. limit=15.0 2023-10-10 13:41:53,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.33 vs. limit=6.0 2023-10-10 13:41:59,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=377416.6666666667, ans=0.05 2023-10-10 13:42:15,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=377510.0, ans=0.2 2023-10-10 13:42:18,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=377510.0, ans=10.0 2023-10-10 13:42:53,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=377650.0, ans=0.5 2023-10-10 13:42:54,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=377650.0, ans=0.125 2023-10-10 13:43:01,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=377696.6666666667, ans=0.1 2023-10-10 13:43:06,816 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.91 vs. limit=15.0 2023-10-10 13:43:14,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=377743.3333333333, ans=0.125 2023-10-10 13:43:34,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.222e+02 1.681e+02 2.004e+02 2.331e+02 3.544e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-10 13:43:48,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=22.5 2023-10-10 13:44:05,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=377976.6666666667, ans=0.125 2023-10-10 13:44:10,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=377976.6666666667, ans=0.2 2023-10-10 13:44:40,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=378116.6666666667, ans=15.0 2023-10-10 13:44:57,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=378210.0, ans=0.09899494936611666 2023-10-10 13:44:59,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=378210.0, ans=0.0 2023-10-10 13:45:05,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.23 vs. limit=22.5 2023-10-10 13:45:06,468 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.06 vs. limit=22.5 2023-10-10 13:45:13,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=378256.6666666667, ans=0.125 2023-10-10 13:45:19,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.759e+02 1.949e+02 2.206e+02 3.201e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-10 13:45:26,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=378303.3333333333, ans=0.04949747468305833 2023-10-10 13:45:31,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=378350.0, ans=0.125 2023-10-10 13:45:36,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378350.0, ans=0.1 2023-10-10 13:45:37,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=378350.0, ans=0.125 2023-10-10 13:45:47,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=378396.6666666667, ans=0.2 2023-10-10 13:45:52,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=378443.3333333333, ans=0.1 2023-10-10 13:46:14,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=378536.6666666667, ans=0.0 2023-10-10 13:46:17,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378536.6666666667, ans=0.1 2023-10-10 13:46:18,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=378536.6666666667, ans=0.125 2023-10-10 13:46:22,473 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.15 vs. limit=15.0 2023-10-10 13:46:42,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378630.0, ans=0.1 2023-10-10 13:46:57,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378723.3333333333, ans=0.1 2023-10-10 13:47:13,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.735e+02 1.991e+02 2.302e+02 3.344e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-10 13:47:15,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=378770.0, ans=0.0 2023-10-10 13:47:23,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=378816.6666666667, ans=0.2 2023-10-10 13:47:24,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=378816.6666666667, ans=0.125 2023-10-10 13:47:48,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=378910.0, ans=0.125 2023-10-10 13:47:49,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378910.0, ans=0.1 2023-10-10 13:47:57,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=378956.6666666667, ans=0.5 2023-10-10 13:48:15,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=379050.0, ans=0.125 2023-10-10 13:48:30,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=379096.6666666667, ans=0.0 2023-10-10 13:48:58,404 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-10-10 13:48:58,815 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.715e+02 1.925e+02 2.276e+02 3.482e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-10 13:49:06,800 INFO [train.py:1031] (0/4) Epoch 6, batch 13000, loss[loss=0.2254, simple_loss=0.3088, pruned_loss=0.07101, over 17038.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3077, pruned_loss=0.06911, over 32779045.17 frames. ], batch size: 117, lr: 6.04e-03, grad_scale: 16.0 2023-10-10 13:49:32,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=379376.6666666667, ans=0.125 2023-10-10 13:49:36,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=379376.6666666667, ans=0.125 2023-10-10 13:49:43,161 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.43 vs. limit=15.0 2023-10-10 13:49:56,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=379470.0, ans=0.0 2023-10-10 13:49:57,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=379470.0, ans=0.2 2023-10-10 13:49:58,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=379470.0, ans=0.04949747468305833 2023-10-10 13:49:58,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-10-10 13:50:30,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=379610.0, ans=0.2 2023-10-10 13:50:35,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=379610.0, ans=0.2 2023-10-10 13:50:48,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=379656.6666666667, ans=0.125 2023-10-10 13:50:57,661 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.668e+02 1.952e+02 2.206e+02 2.975e+02, threshold=3.904e+02, percent-clipped=0.0 2023-10-10 13:51:08,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=379750.0, ans=0.0 2023-10-10 13:51:20,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=379796.6666666667, ans=0.125 2023-10-10 13:51:45,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.59 vs. limit=15.0 2023-10-10 13:51:53,978 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:52:02,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=379983.3333333333, ans=0.125 2023-10-10 13:52:11,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=380030.0, ans=0.125 2023-10-10 13:52:19,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380030.0, ans=0.1 2023-10-10 13:52:24,840 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 13:52:45,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=380123.3333333333, ans=0.05 2023-10-10 13:52:46,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=380123.3333333333, ans=0.0 2023-10-10 13:52:52,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.899e+02 2.307e+02 2.684e+02 3.932e+02, threshold=4.614e+02, percent-clipped=1.0 2023-10-10 13:53:05,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=380216.6666666667, ans=0.0 2023-10-10 13:53:12,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=380263.3333333333, ans=0.0 2023-10-10 13:53:13,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380263.3333333333, ans=0.1 2023-10-10 13:53:26,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=380310.0, ans=0.125 2023-10-10 13:53:38,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=380356.6666666667, ans=0.125 2023-10-10 13:53:40,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=380403.3333333333, ans=0.2 2023-10-10 13:53:57,357 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.28 vs. limit=6.0 2023-10-10 13:53:59,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.99 vs. limit=10.0 2023-10-10 13:54:03,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=380496.6666666667, ans=0.125 2023-10-10 13:54:06,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=380496.6666666667, ans=0.125 2023-10-10 13:54:17,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=380543.3333333333, ans=0.04949747468305833 2023-10-10 13:54:20,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=380543.3333333333, ans=0.0 2023-10-10 13:54:39,097 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.278e+02 1.693e+02 1.907e+02 2.184e+02 2.766e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-10 13:54:39,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=380636.6666666667, ans=0.125 2023-10-10 13:54:46,138 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.83 vs. limit=10.0 2023-10-10 13:54:53,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.45 vs. limit=15.0 2023-10-10 13:54:54,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=380683.3333333333, ans=0.125 2023-10-10 13:55:03,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=380730.0, ans=0.5 2023-10-10 13:55:08,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=380776.6666666667, ans=0.125 2023-10-10 13:55:25,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=380823.3333333333, ans=0.0 2023-10-10 13:55:35,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.28 vs. limit=15.0 2023-10-10 13:55:55,200 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.90 vs. limit=15.0 2023-10-10 13:55:58,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=381010.0, ans=0.035 2023-10-10 13:56:05,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=381010.0, ans=0.125 2023-10-10 13:56:10,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=381010.0, ans=0.125 2023-10-10 13:56:14,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-10-10 13:56:24,582 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.813e+02 2.019e+02 2.311e+02 3.291e+02, threshold=4.039e+02, percent-clipped=0.0 2023-10-10 13:56:44,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=381196.6666666667, ans=0.125 2023-10-10 13:56:58,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.96 vs. limit=15.0 2023-10-10 13:56:59,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=381243.3333333333, ans=0.125 2023-10-10 13:57:01,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=381243.3333333333, ans=0.1 2023-10-10 13:57:05,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=381243.3333333333, ans=10.0 2023-10-10 13:57:06,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=381290.0, ans=0.0 2023-10-10 13:57:06,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=381290.0, ans=0.2 2023-10-10 13:57:16,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=381336.6666666667, ans=0.2 2023-10-10 13:57:20,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=381336.6666666667, ans=0.125 2023-10-10 13:57:47,959 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.62 vs. limit=15.0 2023-10-10 13:57:51,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=381476.6666666667, ans=0.125 2023-10-10 13:57:58,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381523.3333333333, ans=0.1 2023-10-10 13:58:14,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.676e+02 1.841e+02 2.050e+02 2.878e+02, threshold=3.683e+02, percent-clipped=0.0 2023-10-10 13:58:19,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=381570.0, ans=0.125 2023-10-10 13:58:20,594 INFO [train.py:1031] (0/4) Epoch 6, batch 13500, loss[loss=0.2084, simple_loss=0.3065, pruned_loss=0.05517, over 16855.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3069, pruned_loss=0.06877, over 32818788.30 frames. ], batch size: 98, lr: 6.02e-03, grad_scale: 16.0 2023-10-10 13:58:28,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=381616.6666666667, ans=0.0 2023-10-10 13:58:37,197 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-10-10 13:59:04,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=381803.3333333333, ans=0.125 2023-10-10 13:59:12,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.99 vs. limit=22.5 2023-10-10 13:59:23,417 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-10-10 13:59:27,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=381896.6666666667, ans=0.125 2023-10-10 13:59:35,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=381943.3333333333, ans=0.0 2023-10-10 13:59:36,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=381943.3333333333, ans=0.125 2023-10-10 13:59:36,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=381943.3333333333, ans=0.1 2023-10-10 13:59:42,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=381943.3333333333, ans=0.0 2023-10-10 13:59:56,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=382036.6666666667, ans=0.1 2023-10-10 13:59:59,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.797e+02 1.967e+02 2.458e+02 3.442e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-10 13:59:59,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=382036.6666666667, ans=0.0 2023-10-10 14:00:11,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=382083.3333333333, ans=0.125 2023-10-10 14:00:18,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=382130.0, ans=0.0 2023-10-10 14:00:19,801 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.14 vs. limit=15.0 2023-10-10 14:00:23,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=382130.0, ans=0.1 2023-10-10 14:00:25,853 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.69 vs. limit=15.0 2023-10-10 14:00:29,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=382176.6666666667, ans=0.125 2023-10-10 14:00:45,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.67 vs. limit=15.0 2023-10-10 14:00:59,388 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-6.pt 2023-10-10 14:01:27,296 INFO [train.py:1031] (0/4) Epoch 7, batch 0, loss[loss=0.1955, simple_loss=0.2774, pruned_loss=0.05685, over 16660.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2774, pruned_loss=0.05685, over 16660.00 frames. ], batch size: 61, lr: 5.51e-03, grad_scale: 32.0 2023-10-10 14:01:27,298 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-10 14:01:35,185 INFO [train.py:1063] (0/4) Epoch 7, validation: loss=0.2282, simple_loss=0.3154, pruned_loss=0.07055, over 1020973.00 frames. 2023-10-10 14:01:35,185 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-10 14:01:41,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.79 vs. limit=10.0 2023-10-10 14:01:49,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=382386.6666666667, ans=0.125 2023-10-10 14:01:54,954 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.08 vs. limit=15.0 2023-10-10 14:02:18,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.759e+02 1.982e+02 2.350e+02 4.264e+02, threshold=3.963e+02, percent-clipped=2.0 2023-10-10 14:02:28,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.14 vs. limit=12.0 2023-10-10 14:02:33,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=382573.3333333333, ans=0.125 2023-10-10 14:03:13,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=382760.0, ans=0.2 2023-10-10 14:03:29,355 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.70 vs. limit=22.5 2023-10-10 14:03:49,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=382900.0, ans=0.0 2023-10-10 14:04:01,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=382946.6666666667, ans=0.2 2023-10-10 14:04:12,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.683e+02 1.817e+02 2.092e+02 3.494e+02, threshold=3.634e+02, percent-clipped=0.0 2023-10-10 14:04:15,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=382993.3333333333, ans=0.1 2023-10-10 14:04:33,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=383086.6666666667, ans=0.125 2023-10-10 14:04:56,231 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.14 vs. limit=15.0 2023-10-10 14:05:59,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=383413.3333333333, ans=0.2 2023-10-10 14:06:05,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.685e+02 1.925e+02 2.168e+02 2.777e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-10 14:06:05,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=383460.0, ans=0.2 2023-10-10 14:06:12,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=383460.0, ans=0.0 2023-10-10 14:06:16,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=383506.6666666667, ans=0.125 2023-10-10 14:06:19,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383506.6666666667, ans=0.1 2023-10-10 14:06:26,406 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2023-10-10 14:06:41,842 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.57 vs. limit=15.0 2023-10-10 14:06:49,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=383646.6666666667, ans=0.125 2023-10-10 14:07:01,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=383693.3333333333, ans=0.0 2023-10-10 14:07:08,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.45 vs. limit=15.0 2023-10-10 14:07:10,229 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.90 vs. limit=22.5 2023-10-10 14:07:10,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=383740.0, ans=0.125 2023-10-10 14:07:22,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383786.6666666667, ans=0.1 2023-10-10 14:07:31,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=383786.6666666667, ans=0.125 2023-10-10 14:07:47,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=383880.0, ans=0.0 2023-10-10 14:07:57,652 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.657e+02 1.844e+02 2.059e+02 3.367e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-10 14:08:08,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=383973.3333333333, ans=0.125 2023-10-10 14:08:25,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=384020.0, ans=0.125 2023-10-10 14:08:36,685 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.56 vs. limit=22.5 2023-10-10 14:08:55,423 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.39 vs. limit=22.5 2023-10-10 14:09:14,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=384253.3333333333, ans=0.0 2023-10-10 14:09:15,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=384253.3333333333, ans=0.0 2023-10-10 14:09:16,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=384253.3333333333, ans=0.125 2023-10-10 14:09:33,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=384300.0, ans=0.125 2023-10-10 14:09:37,306 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.65 vs. limit=10.0 2023-10-10 14:09:40,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=384346.6666666667, ans=0.125 2023-10-10 14:09:45,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.740e+02 1.924e+02 2.228e+02 3.092e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-10 14:09:51,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=384393.3333333333, ans=0.0 2023-10-10 14:10:09,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=384486.6666666667, ans=0.1 2023-10-10 14:10:15,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=384486.6666666667, ans=0.0 2023-10-10 14:10:29,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=384533.3333333333, ans=0.125 2023-10-10 14:10:35,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=384580.0, ans=0.1 2023-10-10 14:10:42,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=384580.0, ans=0.125 2023-10-10 14:10:43,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=384580.0, ans=0.0 2023-10-10 14:10:44,726 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.62 vs. limit=10.0 2023-10-10 14:10:57,835 INFO [train.py:1031] (0/4) Epoch 7, batch 500, loss[loss=0.2199, simple_loss=0.2982, pruned_loss=0.07083, over 16630.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3059, pruned_loss=0.06851, over 7281808.64 frames. ], batch size: 241, lr: 5.49e-03, grad_scale: 16.0 2023-10-10 14:10:58,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=384673.3333333333, ans=0.07 2023-10-10 14:11:11,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=384720.0, ans=0.05 2023-10-10 14:11:16,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=384720.0, ans=0.125 2023-10-10 14:11:21,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=384766.6666666667, ans=0.125 2023-10-10 14:11:32,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=384813.3333333333, ans=0.0 2023-10-10 14:11:44,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.693e+02 1.917e+02 2.195e+02 3.145e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-10 14:11:49,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384860.0, ans=0.1 2023-10-10 14:11:59,389 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.46 vs. limit=10.0 2023-10-10 14:12:01,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=384906.6666666667, ans=0.1 2023-10-10 14:12:02,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.31 vs. limit=15.0 2023-10-10 14:12:09,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-10-10 14:12:20,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=385000.0, ans=0.2 2023-10-10 14:12:28,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=385000.0, ans=0.2 2023-10-10 14:12:36,310 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.33 vs. limit=10.0 2023-10-10 14:13:01,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=385140.0, ans=0.125 2023-10-10 14:13:09,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=385186.6666666667, ans=0.125 2023-10-10 14:13:13,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=385233.3333333333, ans=0.0 2023-10-10 14:13:14,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=385233.3333333333, ans=0.2 2023-10-10 14:13:40,556 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.808e+02 2.070e+02 2.377e+02 3.394e+02, threshold=4.139e+02, percent-clipped=0.0 2023-10-10 14:13:41,076 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.06 vs. limit=15.0 2023-10-10 14:13:55,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=15.0 2023-10-10 14:14:17,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=385466.6666666667, ans=0.0 2023-10-10 14:14:30,316 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-10-10 14:15:27,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.78 vs. limit=10.0 2023-10-10 14:15:30,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=385746.6666666667, ans=0.125 2023-10-10 14:15:34,651 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.768e+02 1.936e+02 2.120e+02 2.899e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-10 14:15:54,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=385840.0, ans=0.0 2023-10-10 14:16:06,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=385933.3333333333, ans=0.0 2023-10-10 14:16:12,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=385933.3333333333, ans=0.1 2023-10-10 14:16:16,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=385933.3333333333, ans=0.2 2023-10-10 14:16:17,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=385980.0, ans=0.05 2023-10-10 14:16:21,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=385980.0, ans=0.125 2023-10-10 14:16:50,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=386073.3333333333, ans=0.0 2023-10-10 14:16:55,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=386120.0, ans=0.125 2023-10-10 14:17:07,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=386166.6666666667, ans=0.0 2023-10-10 14:17:21,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=386213.3333333333, ans=0.125 2023-10-10 14:17:33,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.38 vs. limit=10.0 2023-10-10 14:17:37,101 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.718e+02 1.869e+02 2.020e+02 3.098e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-10 14:17:56,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=386306.6666666667, ans=0.2 2023-10-10 14:17:59,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=386353.3333333333, ans=0.125 2023-10-10 14:18:13,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=386400.0, ans=0.125 2023-10-10 14:18:21,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=386446.6666666667, ans=0.125 2023-10-10 14:18:27,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=386446.6666666667, ans=0.125 2023-10-10 14:18:30,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=386493.3333333333, ans=0.2 2023-10-10 14:18:33,172 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.81 vs. limit=15.0 2023-10-10 14:18:36,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=386493.3333333333, ans=0.05 2023-10-10 14:18:40,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=386493.3333333333, ans=0.02 2023-10-10 14:18:47,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386540.0, ans=0.1 2023-10-10 14:18:50,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=386540.0, ans=0.125 2023-10-10 14:18:54,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386586.6666666667, ans=0.1 2023-10-10 14:18:56,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=386586.6666666667, ans=0.125 2023-10-10 14:18:58,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=386586.6666666667, ans=0.125 2023-10-10 14:19:02,407 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.54 vs. limit=15.0 2023-10-10 14:19:05,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386633.3333333333, ans=0.1 2023-10-10 14:19:27,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=386726.6666666667, ans=0.0 2023-10-10 14:19:30,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.260e+02 1.688e+02 1.839e+02 1.997e+02 3.239e+02, threshold=3.679e+02, percent-clipped=0.0 2023-10-10 14:19:45,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=386773.3333333333, ans=0.2 2023-10-10 14:20:02,350 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:20:08,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=386866.6666666667, ans=0.1 2023-10-10 14:20:12,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=386866.6666666667, ans=0.2 2023-10-10 14:20:34,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=386960.0, ans=0.125 2023-10-10 14:20:40,028 INFO [train.py:1031] (0/4) Epoch 7, batch 1000, loss[loss=0.2046, simple_loss=0.29, pruned_loss=0.05959, over 16469.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3068, pruned_loss=0.06874, over 12937485.79 frames. ], batch size: 50, lr: 5.48e-03, grad_scale: 32.0 2023-10-10 14:21:03,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.42 vs. limit=15.0 2023-10-10 14:21:09,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387100.0, ans=0.1 2023-10-10 14:21:24,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.648e+02 1.829e+02 2.078e+02 2.874e+02, threshold=3.658e+02, percent-clipped=0.0 2023-10-10 14:21:30,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387193.3333333333, ans=0.1 2023-10-10 14:21:37,309 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.38 vs. limit=22.5 2023-10-10 14:22:02,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=387333.3333333333, ans=0.2 2023-10-10 14:22:33,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=387473.3333333333, ans=0.125 2023-10-10 14:22:37,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=387473.3333333333, ans=0.07 2023-10-10 14:22:44,307 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.07 vs. limit=15.0 2023-10-10 14:23:12,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=387613.3333333333, ans=0.125 2023-10-10 14:23:16,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=387613.3333333333, ans=0.0 2023-10-10 14:23:17,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=387613.3333333333, ans=0.125 2023-10-10 14:23:20,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.778e+02 1.984e+02 2.177e+02 2.907e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-10 14:23:23,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=387660.0, ans=0.1 2023-10-10 14:23:24,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=387660.0, ans=0.0 2023-10-10 14:24:05,406 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=22.5 2023-10-10 14:24:22,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=387893.3333333333, ans=0.0 2023-10-10 14:24:33,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=387940.0, ans=0.1 2023-10-10 14:24:42,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387940.0, ans=0.1 2023-10-10 14:24:45,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=387986.6666666667, ans=0.125 2023-10-10 14:24:48,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=387986.6666666667, ans=0.0 2023-10-10 14:24:59,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388033.3333333333, ans=0.1 2023-10-10 14:25:03,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=388033.3333333333, ans=0.0 2023-10-10 14:25:11,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=388080.0, ans=0.125 2023-10-10 14:25:19,445 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.277e+02 1.667e+02 1.871e+02 2.075e+02 3.092e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-10 14:25:24,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=388126.6666666667, ans=0.0 2023-10-10 14:25:45,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=388220.0, ans=0.125 2023-10-10 14:25:50,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-10-10 14:25:54,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=388266.6666666667, ans=10.0 2023-10-10 14:26:07,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=388313.3333333333, ans=0.125 2023-10-10 14:26:10,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=388313.3333333333, ans=0.125 2023-10-10 14:26:11,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.63 vs. limit=6.0 2023-10-10 14:26:40,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=388453.3333333333, ans=0.07 2023-10-10 14:27:04,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=388546.6666666667, ans=0.125 2023-10-10 14:27:13,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.757e+02 1.959e+02 2.242e+02 3.249e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-10 14:27:14,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=388593.3333333333, ans=0.1 2023-10-10 14:27:19,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=388593.3333333333, ans=0.125 2023-10-10 14:27:21,434 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=22.5 2023-10-10 14:27:31,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.17 vs. limit=15.0 2023-10-10 14:27:36,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.65 vs. limit=12.0 2023-10-10 14:27:49,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=388733.3333333333, ans=0.0 2023-10-10 14:28:07,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=388826.6666666667, ans=0.125 2023-10-10 14:28:08,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=388826.6666666667, ans=0.125 2023-10-10 14:28:10,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=388826.6666666667, ans=0.125 2023-10-10 14:28:42,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=388966.6666666667, ans=0.125 2023-10-10 14:28:49,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=388966.6666666667, ans=0.125 2023-10-10 14:28:56,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=389013.3333333333, ans=0.035 2023-10-10 14:29:09,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=389060.0, ans=0.0 2023-10-10 14:29:10,563 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.680e+02 1.864e+02 2.185e+02 3.818e+02, threshold=3.728e+02, percent-clipped=0.0 2023-10-10 14:29:10,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=389060.0, ans=0.125 2023-10-10 14:29:27,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=389106.6666666667, ans=0.125 2023-10-10 14:29:37,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=389153.3333333333, ans=0.125 2023-10-10 14:30:13,738 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:30:21,079 INFO [train.py:1031] (0/4) Epoch 7, batch 1500, loss[loss=0.1931, simple_loss=0.276, pruned_loss=0.05505, over 16026.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3048, pruned_loss=0.06754, over 17341126.30 frames. ], batch size: 43, lr: 5.46e-03, grad_scale: 16.0 2023-10-10 14:30:27,411 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:30:29,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=389340.0, ans=0.0 2023-10-10 14:30:43,628 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.06 vs. limit=15.0 2023-10-10 14:31:12,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.713e+02 1.914e+02 2.273e+02 3.600e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-10 14:31:15,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=389526.6666666667, ans=0.09899494936611666 2023-10-10 14:31:31,603 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.78 vs. limit=22.5 2023-10-10 14:31:43,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=389620.0, ans=0.09899494936611666 2023-10-10 14:32:00,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=389713.3333333333, ans=0.125 2023-10-10 14:32:01,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=389713.3333333333, ans=0.125 2023-10-10 14:32:03,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=389713.3333333333, ans=0.125 2023-10-10 14:32:15,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=389760.0, ans=0.125 2023-10-10 14:32:26,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=389806.6666666667, ans=0.0 2023-10-10 14:32:29,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.94 vs. limit=15.0 2023-10-10 14:32:30,418 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.75 vs. limit=12.0 2023-10-10 14:32:42,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=389900.0, ans=0.125 2023-10-10 14:33:11,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.690e+02 1.920e+02 2.313e+02 3.796e+02, threshold=3.840e+02, percent-clipped=0.0 2023-10-10 14:33:26,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=390040.0, ans=0.0 2023-10-10 14:33:27,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390040.0, ans=0.1 2023-10-10 14:33:33,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390086.6666666667, ans=0.1 2023-10-10 14:33:36,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=390086.6666666667, ans=0.125 2023-10-10 14:33:39,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390086.6666666667, ans=0.1 2023-10-10 14:33:43,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=390133.3333333333, ans=0.5 2023-10-10 14:34:15,602 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.74 vs. limit=10.0 2023-10-10 14:34:27,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=390320.0, ans=0.125 2023-10-10 14:34:42,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=390366.6666666667, ans=0.125 2023-10-10 14:34:46,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=390413.3333333333, ans=0.125 2023-10-10 14:34:57,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.656e+02 1.811e+02 1.971e+02 2.533e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-10 14:35:04,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=390460.0, ans=0.0 2023-10-10 14:35:34,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=390600.0, ans=0.95 2023-10-10 14:35:44,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.70 vs. limit=12.0 2023-10-10 14:36:01,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=390693.3333333333, ans=0.0 2023-10-10 14:36:05,370 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.96 vs. limit=10.0 2023-10-10 14:36:08,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.34 vs. limit=22.5 2023-10-10 14:36:27,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-10-10 14:36:28,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=12.0 2023-10-10 14:36:39,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=390880.0, ans=0.125 2023-10-10 14:36:53,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.299e+02 1.596e+02 1.760e+02 2.051e+02 3.571e+02, threshold=3.520e+02, percent-clipped=0.0 2023-10-10 14:37:02,855 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:37:09,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=390973.3333333333, ans=0.125 2023-10-10 14:37:11,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390973.3333333333, ans=0.1 2023-10-10 14:37:13,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=391020.0, ans=0.125 2023-10-10 14:37:14,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=391020.0, ans=0.0 2023-10-10 14:37:18,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=391020.0, ans=0.0 2023-10-10 14:37:20,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=391020.0, ans=0.125 2023-10-10 14:37:29,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=391066.6666666667, ans=0.1 2023-10-10 14:37:29,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=391066.6666666667, ans=0.0 2023-10-10 14:37:50,546 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.82 vs. limit=15.0 2023-10-10 14:37:51,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391160.0, ans=0.1 2023-10-10 14:38:14,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=391253.3333333333, ans=0.0 2023-10-10 14:38:47,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.764e+02 1.959e+02 2.185e+02 3.012e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-10 14:39:26,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=391533.3333333333, ans=0.0 2023-10-10 14:39:42,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391580.0, ans=0.1 2023-10-10 14:40:04,689 INFO [train.py:1031] (0/4) Epoch 7, batch 2000, loss[loss=0.2361, simple_loss=0.3312, pruned_loss=0.07051, over 16801.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3052, pruned_loss=0.06765, over 20736262.96 frames. ], batch size: 175, lr: 5.44e-03, grad_scale: 32.0 2023-10-10 14:40:10,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=391673.3333333333, ans=0.125 2023-10-10 14:40:19,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-10-10 14:40:28,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=391720.0, ans=0.0 2023-10-10 14:40:55,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=391813.3333333333, ans=0.125 2023-10-10 14:41:04,665 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.695e+02 1.890e+02 2.124e+02 3.384e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-10 14:41:32,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=391953.3333333333, ans=0.0 2023-10-10 14:41:34,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=391953.3333333333, ans=0.2 2023-10-10 14:41:43,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-10-10 14:41:49,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=392046.6666666667, ans=0.125 2023-10-10 14:41:57,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392046.6666666667, ans=0.1 2023-10-10 14:42:08,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=392093.3333333333, ans=0.1 2023-10-10 14:42:08,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=392093.3333333333, ans=0.125 2023-10-10 14:42:55,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=392233.3333333333, ans=0.0 2023-10-10 14:43:09,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392233.3333333333, ans=0.1 2023-10-10 14:43:14,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=392280.0, ans=0.1 2023-10-10 14:43:19,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=392280.0, ans=0.125 2023-10-10 14:43:19,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2023-10-10 14:43:20,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=392280.0, ans=0.1 2023-10-10 14:43:27,689 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.746e-03 2023-10-10 14:43:29,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.598e+02 1.792e+02 2.086e+02 3.103e+02, threshold=3.585e+02, percent-clipped=0.0 2023-10-10 14:43:30,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.29 vs. limit=15.0 2023-10-10 14:43:56,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=392420.0, ans=0.0 2023-10-10 14:43:57,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=392420.0, ans=0.125 2023-10-10 14:43:58,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=392420.0, ans=0.0 2023-10-10 14:44:10,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=392466.6666666667, ans=0.0 2023-10-10 14:44:30,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=392513.3333333333, ans=15.0 2023-10-10 14:44:30,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=392513.3333333333, ans=0.0 2023-10-10 14:44:46,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=392560.0, ans=0.2 2023-10-10 14:44:51,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.09 vs. limit=15.0 2023-10-10 14:44:57,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=392606.6666666667, ans=0.125 2023-10-10 14:45:01,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=392653.3333333333, ans=0.07 2023-10-10 14:45:04,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=392653.3333333333, ans=0.0 2023-10-10 14:45:12,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=392653.3333333333, ans=0.0 2023-10-10 14:45:13,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=392653.3333333333, ans=0.125 2023-10-10 14:45:13,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=392653.3333333333, ans=0.125 2023-10-10 14:45:14,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2023-10-10 14:45:26,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392746.6666666667, ans=0.1 2023-10-10 14:45:39,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.731e+02 1.921e+02 2.161e+02 3.047e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-10 14:45:45,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.00 vs. limit=15.0 2023-10-10 14:45:46,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=392793.3333333333, ans=0.05 2023-10-10 14:45:52,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=392840.0, ans=0.125 2023-10-10 14:46:21,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=392980.0, ans=0.95 2023-10-10 14:46:21,870 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.36 vs. limit=5.0 2023-10-10 14:46:24,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=392980.0, ans=0.125 2023-10-10 14:46:36,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=393026.6666666667, ans=0.125 2023-10-10 14:46:48,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=393073.3333333333, ans=0.1 2023-10-10 14:47:20,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=393213.3333333333, ans=0.0 2023-10-10 14:47:34,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=393260.0, ans=0.2 2023-10-10 14:47:37,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=393260.0, ans=0.0 2023-10-10 14:47:38,497 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.753e+02 1.915e+02 2.104e+02 2.840e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-10 14:48:10,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393400.0, ans=0.1 2023-10-10 14:48:13,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393400.0, ans=0.1 2023-10-10 14:48:14,426 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:48:16,225 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.05 vs. limit=6.0 2023-10-10 14:48:26,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=393446.6666666667, ans=0.125 2023-10-10 14:48:27,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=393446.6666666667, ans=0.125 2023-10-10 14:48:40,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=393493.3333333333, ans=0.125 2023-10-10 14:48:42,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=393493.3333333333, ans=0.125 2023-10-10 14:48:51,109 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.29 vs. limit=22.5 2023-10-10 14:48:58,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=393586.6666666667, ans=0.0 2023-10-10 14:49:02,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=393586.6666666667, ans=0.0 2023-10-10 14:49:18,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=393680.0, ans=0.0 2023-10-10 14:49:19,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=393680.0, ans=0.0 2023-10-10 14:49:32,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.743e+02 1.943e+02 2.270e+02 3.085e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-10 14:49:45,479 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.57 vs. limit=12.0 2023-10-10 14:49:55,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=393820.0, ans=0.125 2023-10-10 14:49:58,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=393820.0, ans=0.0 2023-10-10 14:50:05,583 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.24 vs. limit=22.5 2023-10-10 14:50:14,316 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.11 vs. limit=22.5 2023-10-10 14:50:31,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=393960.0, ans=0.125 2023-10-10 14:50:32,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=394006.6666666667, ans=0.0 2023-10-10 14:50:33,210 INFO [train.py:1031] (0/4) Epoch 7, batch 2500, loss[loss=0.2187, simple_loss=0.3037, pruned_loss=0.06688, over 16986.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3053, pruned_loss=0.06764, over 23426828.06 frames. ], batch size: 123, lr: 5.43e-03, grad_scale: 32.0 2023-10-10 14:50:53,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.88 vs. limit=10.0 2023-10-10 14:51:03,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=394146.6666666667, ans=0.2 2023-10-10 14:51:11,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=394146.6666666667, ans=0.125 2023-10-10 14:51:16,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=394193.3333333333, ans=0.125 2023-10-10 14:51:17,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=394193.3333333333, ans=0.0 2023-10-10 14:51:19,044 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.742e+02 1.915e+02 2.111e+02 3.351e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-10 14:51:41,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=394286.6666666667, ans=0.2 2023-10-10 14:51:41,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=394286.6666666667, ans=10.0 2023-10-10 14:51:57,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=394380.0, ans=0.0 2023-10-10 14:52:41,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=394520.0, ans=0.0 2023-10-10 14:52:42,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=394520.0, ans=0.125 2023-10-10 14:52:45,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=394566.6666666667, ans=0.0 2023-10-10 14:52:55,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=394613.3333333333, ans=0.125 2023-10-10 14:53:07,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394660.0, ans=0.1 2023-10-10 14:53:09,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.804e+02 1.956e+02 2.373e+02 3.236e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-10 14:53:20,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=394706.6666666667, ans=0.0 2023-10-10 14:53:20,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=394706.6666666667, ans=0.125 2023-10-10 14:53:21,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=394706.6666666667, ans=0.035 2023-10-10 14:53:25,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=394706.6666666667, ans=0.125 2023-10-10 14:54:10,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=394940.0, ans=0.07 2023-10-10 14:54:30,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=394986.6666666667, ans=0.0 2023-10-10 14:54:54,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-10-10 14:55:03,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.723e+02 1.939e+02 2.192e+02 3.740e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-10 14:55:03,774 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:55:06,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=395126.6666666667, ans=0.0 2023-10-10 14:55:13,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=395173.3333333333, ans=0.2 2023-10-10 14:55:15,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=395173.3333333333, ans=0.0 2023-10-10 14:55:24,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=395173.3333333333, ans=15.0 2023-10-10 14:55:45,466 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.512e-03 2023-10-10 14:56:11,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=395360.0, ans=0.1 2023-10-10 14:56:26,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=395406.6666666667, ans=0.07 2023-10-10 14:56:40,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.76 vs. limit=10.0 2023-10-10 14:56:50,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=395500.0, ans=0.0 2023-10-10 14:56:51,362 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.18 vs. limit=15.0 2023-10-10 14:57:08,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=395593.3333333333, ans=0.0 2023-10-10 14:57:11,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.327e+02 1.631e+02 1.896e+02 2.111e+02 3.244e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-10 14:57:17,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.97 vs. limit=15.0 2023-10-10 14:57:32,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=395686.6666666667, ans=0.0 2023-10-10 14:57:36,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=395686.6666666667, ans=0.1 2023-10-10 14:57:58,231 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 14:58:14,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=395826.6666666667, ans=0.1 2023-10-10 14:58:36,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=395920.0, ans=0.125 2023-10-10 14:59:00,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=396013.3333333333, ans=0.2 2023-10-10 14:59:15,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.654e+02 1.915e+02 2.150e+02 3.116e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-10 15:00:05,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=396246.6666666667, ans=0.125 2023-10-10 15:00:17,535 INFO [train.py:1031] (0/4) Epoch 7, batch 3000, loss[loss=0.2157, simple_loss=0.2969, pruned_loss=0.06724, over 16909.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3044, pruned_loss=0.0673, over 25501897.90 frames. ], batch size: 138, lr: 5.41e-03, grad_scale: 16.0 2023-10-10 15:00:25,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=396340.0, ans=0.2 2023-10-10 15:00:44,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.77 vs. limit=6.0 2023-10-10 15:00:46,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=396433.3333333333, ans=0.05 2023-10-10 15:00:54,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=396480.0, ans=0.025 2023-10-10 15:01:09,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.764e+02 2.003e+02 2.343e+02 3.849e+02, threshold=4.007e+02, percent-clipped=1.0 2023-10-10 15:01:14,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=396526.6666666667, ans=0.125 2023-10-10 15:01:16,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=396573.3333333333, ans=0.125 2023-10-10 15:01:24,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=396573.3333333333, ans=0.125 2023-10-10 15:01:24,808 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.52 vs. limit=15.0 2023-10-10 15:01:31,997 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.57 vs. limit=10.0 2023-10-10 15:01:59,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=396760.0, ans=0.125 2023-10-10 15:01:59,935 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.45 vs. limit=22.5 2023-10-10 15:02:10,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=22.5 2023-10-10 15:02:16,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=396806.6666666667, ans=0.125 2023-10-10 15:02:28,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.64 vs. limit=22.5 2023-10-10 15:02:33,761 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:02:44,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=396900.0, ans=0.125 2023-10-10 15:02:54,192 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:02:57,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=396946.6666666667, ans=0.2 2023-10-10 15:03:08,319 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.711e+02 1.845e+02 2.157e+02 2.806e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-10 15:03:44,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=397133.3333333333, ans=0.09899494936611666 2023-10-10 15:04:15,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=397273.3333333333, ans=0.125 2023-10-10 15:04:20,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=397273.3333333333, ans=10.0 2023-10-10 15:04:22,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=397320.0, ans=0.125 2023-10-10 15:04:31,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=397320.0, ans=0.09899494936611666 2023-10-10 15:05:03,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=397460.0, ans=0.0 2023-10-10 15:05:09,573 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=22.5 2023-10-10 15:05:09,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.317e+02 1.755e+02 2.084e+02 2.367e+02 3.510e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-10 15:05:13,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=397460.0, ans=0.0 2023-10-10 15:05:20,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=397506.6666666667, ans=0.0 2023-10-10 15:05:29,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397506.6666666667, ans=0.1 2023-10-10 15:05:49,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397600.0, ans=0.1 2023-10-10 15:05:54,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.60 vs. limit=10.0 2023-10-10 15:06:04,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=397646.6666666667, ans=0.125 2023-10-10 15:06:19,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=397693.3333333333, ans=0.125 2023-10-10 15:06:51,003 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.82 vs. limit=15.0 2023-10-10 15:06:53,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.81 vs. limit=15.0 2023-10-10 15:07:05,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=397880.0, ans=0.125 2023-10-10 15:07:08,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=397880.0, ans=0.0 2023-10-10 15:07:15,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=397926.6666666667, ans=0.0 2023-10-10 15:07:18,672 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.695e+02 1.892e+02 2.115e+02 2.615e+02, threshold=3.784e+02, percent-clipped=0.0 2023-10-10 15:07:31,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=397973.3333333333, ans=0.125 2023-10-10 15:07:48,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=398066.6666666667, ans=0.125 2023-10-10 15:07:58,928 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:08:17,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.20 vs. limit=15.0 2023-10-10 15:08:37,476 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:08:53,610 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.20 vs. limit=15.0 2023-10-10 15:09:08,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=398346.6666666667, ans=0.1 2023-10-10 15:09:12,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=398393.3333333333, ans=0.125 2023-10-10 15:09:12,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=12.0 2023-10-10 15:09:19,242 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.669e+02 1.829e+02 2.198e+02 3.544e+02, threshold=3.658e+02, percent-clipped=0.0 2023-10-10 15:09:22,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0 2023-10-10 15:09:48,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.61 vs. limit=15.0 2023-10-10 15:09:59,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=398533.3333333333, ans=10.0 2023-10-10 15:10:04,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398580.0, ans=0.1 2023-10-10 15:10:09,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398580.0, ans=0.1 2023-10-10 15:10:13,695 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.94 vs. limit=15.0 2023-10-10 15:10:14,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398626.6666666667, ans=0.1 2023-10-10 15:10:25,439 INFO [train.py:1031] (0/4) Epoch 7, batch 3500, loss[loss=0.2121, simple_loss=0.2948, pruned_loss=0.06467, over 16295.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3044, pruned_loss=0.06752, over 27115126.68 frames. ], batch size: 50, lr: 5.40e-03, grad_scale: 32.0 2023-10-10 15:10:28,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=398673.3333333333, ans=0.09899494936611666 2023-10-10 15:11:14,593 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.691e+02 1.861e+02 2.108e+02 3.840e+02, threshold=3.722e+02, percent-clipped=1.0 2023-10-10 15:11:15,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=398860.0, ans=0.125 2023-10-10 15:11:32,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=398953.3333333333, ans=6.0 2023-10-10 15:11:59,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=399000.0, ans=0.2 2023-10-10 15:12:10,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=399046.6666666667, ans=0.125 2023-10-10 15:12:30,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=399140.0, ans=0.125 2023-10-10 15:12:50,103 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.67 vs. limit=15.0 2023-10-10 15:13:05,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-10 15:13:25,891 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.652e+02 1.838e+02 2.072e+02 2.985e+02, threshold=3.677e+02, percent-clipped=0.0 2023-10-10 15:13:26,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=399326.6666666667, ans=10.0 2023-10-10 15:13:29,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=399326.6666666667, ans=0.125 2023-10-10 15:13:30,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=399326.6666666667, ans=0.0 2023-10-10 15:13:37,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=399373.3333333333, ans=0.1 2023-10-10 15:13:51,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=399420.0, ans=0.0 2023-10-10 15:13:58,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=399466.6666666667, ans=0.09899494936611666 2023-10-10 15:14:13,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399513.3333333333, ans=0.1 2023-10-10 15:14:49,122 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=15.0 2023-10-10 15:15:29,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.269e+02 1.608e+02 1.726e+02 1.918e+02 2.632e+02, threshold=3.452e+02, percent-clipped=0.0 2023-10-10 15:15:37,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=399840.0, ans=0.0 2023-10-10 15:15:38,507 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.52 vs. limit=22.5 2023-10-10 15:15:56,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=399886.6666666667, ans=0.125 2023-10-10 15:16:13,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=399980.0, ans=0.125 2023-10-10 15:16:17,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=399980.0, ans=0.125 2023-10-10 15:16:39,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=400073.3333333333, ans=0.0 2023-10-10 15:17:25,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.660e+02 1.879e+02 2.069e+02 3.139e+02, threshold=3.758e+02, percent-clipped=0.0 2023-10-10 15:17:36,865 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:17:48,886 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:17:52,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=400353.3333333333, ans=0.125 2023-10-10 15:18:36,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=400540.0, ans=0.125 2023-10-10 15:18:50,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-10-10 15:19:11,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=400726.6666666667, ans=0.125 2023-10-10 15:19:15,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.617e+02 1.788e+02 2.094e+02 2.634e+02, threshold=3.576e+02, percent-clipped=0.0 2023-10-10 15:19:26,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=400773.3333333333, ans=0.07 2023-10-10 15:19:32,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=400820.0, ans=0.2 2023-10-10 15:19:35,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=400820.0, ans=0.125 2023-10-10 15:19:52,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=12.0 2023-10-10 15:19:57,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=400866.6666666667, ans=0.125 2023-10-10 15:20:18,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=400960.0, ans=0.0 2023-10-10 15:20:22,447 INFO [train.py:1031] (0/4) Epoch 7, batch 4000, loss[loss=0.2272, simple_loss=0.3171, pruned_loss=0.06865, over 16945.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.304, pruned_loss=0.06747, over 28374367.73 frames. ], batch size: 156, lr: 5.38e-03, grad_scale: 32.0 2023-10-10 15:20:27,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=401006.6666666667, ans=0.125 2023-10-10 15:20:29,164 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.53 vs. limit=12.0 2023-10-10 15:20:44,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=401053.3333333333, ans=0.125 2023-10-10 15:20:44,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=401053.3333333333, ans=0.0 2023-10-10 15:20:54,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401100.0, ans=0.1 2023-10-10 15:21:18,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.70 vs. limit=15.0 2023-10-10 15:21:18,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.801e+02 2.071e+02 2.423e+02 4.077e+02, threshold=4.142e+02, percent-clipped=2.0 2023-10-10 15:21:25,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=401240.0, ans=0.04949747468305833 2023-10-10 15:21:58,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=401333.3333333333, ans=0.05 2023-10-10 15:22:39,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=401520.0, ans=0.125 2023-10-10 15:22:41,988 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.28 vs. limit=15.0 2023-10-10 15:22:54,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=401566.6666666667, ans=0.125 2023-10-10 15:22:54,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=401566.6666666667, ans=0.0 2023-10-10 15:23:10,258 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.80 vs. limit=10.0 2023-10-10 15:23:19,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=401660.0, ans=0.1 2023-10-10 15:23:21,760 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.844e+02 2.114e+02 2.388e+02 3.529e+02, threshold=4.228e+02, percent-clipped=0.0 2023-10-10 15:23:23,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.27 vs. limit=10.0 2023-10-10 15:23:25,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=401660.0, ans=0.125 2023-10-10 15:23:31,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=401706.6666666667, ans=0.1 2023-10-10 15:23:53,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=401753.3333333333, ans=0.125 2023-10-10 15:24:19,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=401846.6666666667, ans=0.125 2023-10-10 15:24:25,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=401893.3333333333, ans=0.125 2023-10-10 15:24:43,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=401940.0, ans=0.125 2023-10-10 15:25:03,262 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.53 vs. limit=6.0 2023-10-10 15:25:32,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.692e+02 1.940e+02 2.216e+02 3.151e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-10 15:26:04,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=402266.6666666667, ans=0.125 2023-10-10 15:26:15,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=402313.3333333333, ans=0.1 2023-10-10 15:26:17,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=402313.3333333333, ans=0.125 2023-10-10 15:26:23,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=402360.0, ans=0.0 2023-10-10 15:26:29,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=402360.0, ans=0.2 2023-10-10 15:26:36,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=402406.6666666667, ans=0.04949747468305833 2023-10-10 15:26:46,756 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:26:54,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=402453.3333333333, ans=0.2 2023-10-10 15:27:06,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402546.6666666667, ans=0.1 2023-10-10 15:27:15,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=402546.6666666667, ans=0.125 2023-10-10 15:27:15,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=402546.6666666667, ans=0.0 2023-10-10 15:27:19,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.94 vs. limit=10.0 2023-10-10 15:27:20,419 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=12.0 2023-10-10 15:27:23,710 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.796e+02 2.012e+02 2.328e+02 3.393e+02, threshold=4.024e+02, percent-clipped=0.0 2023-10-10 15:27:35,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=402640.0, ans=0.125 2023-10-10 15:27:54,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=402733.3333333333, ans=0.125 2023-10-10 15:28:06,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402780.0, ans=0.1 2023-10-10 15:28:16,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=402826.6666666667, ans=0.125 2023-10-10 15:28:16,280 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.39 vs. limit=10.0 2023-10-10 15:28:45,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=402920.0, ans=0.1 2023-10-10 15:28:54,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=402966.6666666667, ans=0.125 2023-10-10 15:29:19,863 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.14 vs. limit=22.5 2023-10-10 15:29:29,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.882e+02 2.133e+02 2.523e+02 3.546e+02, threshold=4.266e+02, percent-clipped=0.0 2023-10-10 15:29:38,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403106.6666666667, ans=0.1 2023-10-10 15:29:44,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=403153.3333333333, ans=0.125 2023-10-10 15:30:12,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.21 vs. limit=22.5 2023-10-10 15:30:21,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.70 vs. limit=15.0 2023-10-10 15:30:27,612 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:30:27,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=403293.3333333333, ans=0.1 2023-10-10 15:30:29,251 INFO [train.py:1031] (0/4) Epoch 7, batch 4500, loss[loss=0.2288, simple_loss=0.3093, pruned_loss=0.07415, over 16649.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3043, pruned_loss=0.06731, over 29348382.74 frames. ], batch size: 241, lr: 5.36e-03, grad_scale: 32.0 2023-10-10 15:31:10,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=403480.0, ans=0.0 2023-10-10 15:31:19,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.716e+02 1.912e+02 2.204e+02 3.218e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-10 15:31:25,203 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-10-10 15:31:32,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=403573.3333333333, ans=0.125 2023-10-10 15:32:19,521 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:32:31,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=403853.3333333333, ans=0.1 2023-10-10 15:32:42,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=403900.0, ans=0.125 2023-10-10 15:32:44,934 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.50 vs. limit=15.0 2023-10-10 15:32:53,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=403946.6666666667, ans=0.125 2023-10-10 15:32:59,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=403993.3333333333, ans=0.125 2023-10-10 15:33:00,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=403993.3333333333, ans=0.125 2023-10-10 15:33:03,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=403993.3333333333, ans=0.125 2023-10-10 15:33:05,079 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.731e+02 1.887e+02 2.187e+02 3.727e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-10 15:33:05,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=403993.3333333333, ans=0.125 2023-10-10 15:33:05,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=403993.3333333333, ans=0.1 2023-10-10 15:33:05,515 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:33:06,644 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.16 vs. limit=10.0 2023-10-10 15:33:12,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=404040.0, ans=0.125 2023-10-10 15:33:23,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=404086.6666666667, ans=0.0 2023-10-10 15:33:36,613 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.21 vs. limit=22.5 2023-10-10 15:33:43,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404133.3333333333, ans=0.1 2023-10-10 15:33:47,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-10 15:34:03,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=404226.6666666667, ans=0.125 2023-10-10 15:34:09,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=404273.3333333333, ans=0.0 2023-10-10 15:34:10,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=404273.3333333333, ans=0.1 2023-10-10 15:34:20,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=404320.0, ans=0.125 2023-10-10 15:34:40,955 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=12.0 2023-10-10 15:34:42,724 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.30 vs. limit=15.0 2023-10-10 15:34:44,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=404413.3333333333, ans=15.0 2023-10-10 15:34:54,851 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.754e+02 1.949e+02 2.290e+02 3.545e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-10 15:34:55,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=404460.0, ans=0.125 2023-10-10 15:35:04,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=404506.6666666667, ans=0.0 2023-10-10 15:35:06,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.68 vs. limit=10.0 2023-10-10 15:35:24,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=404600.0, ans=0.05 2023-10-10 15:35:29,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=404646.6666666667, ans=0.125 2023-10-10 15:35:41,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=404693.3333333333, ans=0.0 2023-10-10 15:35:51,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=404740.0, ans=0.1 2023-10-10 15:35:54,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=404740.0, ans=0.0 2023-10-10 15:35:55,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=404740.0, ans=0.95 2023-10-10 15:35:56,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=404740.0, ans=0.125 2023-10-10 15:36:24,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.71 vs. limit=6.0 2023-10-10 15:36:48,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.738e+02 1.924e+02 2.160e+02 2.956e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-10 15:37:07,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=405020.0, ans=0.125 2023-10-10 15:37:22,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=405066.6666666667, ans=0.0 2023-10-10 15:37:28,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405113.3333333333, ans=0.1 2023-10-10 15:37:43,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=405160.0, ans=0.125 2023-10-10 15:38:01,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=405253.3333333333, ans=0.125 2023-10-10 15:38:18,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=405300.0, ans=0.0 2023-10-10 15:38:23,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=405346.6666666667, ans=0.125 2023-10-10 15:38:40,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=405393.3333333333, ans=0.04949747468305833 2023-10-10 15:38:40,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=405393.3333333333, ans=0.125 2023-10-10 15:38:44,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=405393.3333333333, ans=0.125 2023-10-10 15:38:44,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.704e+02 1.998e+02 2.408e+02 3.552e+02, threshold=3.996e+02, percent-clipped=0.0 2023-10-10 15:38:45,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=405393.3333333333, ans=0.04949747468305833 2023-10-10 15:38:47,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=405393.3333333333, ans=0.125 2023-10-10 15:38:56,971 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.66 vs. limit=15.0 2023-10-10 15:39:00,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=405486.6666666667, ans=0.125 2023-10-10 15:39:04,855 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:39:04,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-10-10 15:39:06,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=405486.6666666667, ans=0.125 2023-10-10 15:39:14,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=405533.3333333333, ans=0.0 2023-10-10 15:39:20,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=405533.3333333333, ans=0.1 2023-10-10 15:39:37,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=405626.6666666667, ans=0.0 2023-10-10 15:39:44,385 INFO [train.py:1031] (0/4) Epoch 7, batch 5000, loss[loss=0.2051, simple_loss=0.2936, pruned_loss=0.05826, over 15665.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3041, pruned_loss=0.06744, over 30133384.18 frames. ], batch size: 35, lr: 5.35e-03, grad_scale: 32.0 2023-10-10 15:39:55,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=405673.3333333333, ans=0.0 2023-10-10 15:40:01,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=405720.0, ans=0.125 2023-10-10 15:40:11,476 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:40:27,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=405813.3333333333, ans=0.125 2023-10-10 15:40:31,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-10-10 15:40:36,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.739e+02 1.937e+02 2.248e+02 3.255e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-10 15:40:52,359 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.13 vs. limit=10.0 2023-10-10 15:41:18,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=406046.6666666667, ans=0.0 2023-10-10 15:41:25,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=406093.3333333333, ans=0.125 2023-10-10 15:41:32,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=406093.3333333333, ans=0.035 2023-10-10 15:41:51,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=406186.6666666667, ans=0.125 2023-10-10 15:42:00,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=406233.3333333333, ans=0.125 2023-10-10 15:42:04,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=406233.3333333333, ans=0.125 2023-10-10 15:42:09,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=406233.3333333333, ans=0.125 2023-10-10 15:42:16,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=406280.0, ans=0.0 2023-10-10 15:42:19,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-10 15:42:32,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.795e+02 1.945e+02 2.585e+02 4.159e+02, threshold=3.891e+02, percent-clipped=7.0 2023-10-10 15:42:36,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=406373.3333333333, ans=0.2 2023-10-10 15:42:36,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=406373.3333333333, ans=0.125 2023-10-10 15:42:52,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=406420.0, ans=0.0 2023-10-10 15:42:56,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=406466.6666666667, ans=0.125 2023-10-10 15:42:58,051 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.93 vs. limit=10.0 2023-10-10 15:43:43,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=406653.3333333333, ans=0.2 2023-10-10 15:44:04,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=406746.6666666667, ans=0.125 2023-10-10 15:44:16,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.742e+02 1.938e+02 2.206e+02 2.945e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-10 15:44:17,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=406793.3333333333, ans=0.125 2023-10-10 15:44:43,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=406886.6666666667, ans=0.5 2023-10-10 15:44:53,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=406933.3333333333, ans=0.0 2023-10-10 15:44:53,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=406933.3333333333, ans=0.125 2023-10-10 15:44:53,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=406933.3333333333, ans=0.0 2023-10-10 15:45:03,955 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:45:19,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=407026.6666666667, ans=0.125 2023-10-10 15:45:30,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-10-10 15:45:31,003 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.05 vs. limit=15.0 2023-10-10 15:45:38,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407120.0, ans=0.1 2023-10-10 15:45:58,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=407213.3333333333, ans=0.125 2023-10-10 15:45:58,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407213.3333333333, ans=0.1 2023-10-10 15:45:58,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=407213.3333333333, ans=0.025 2023-10-10 15:46:10,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=407213.3333333333, ans=0.0 2023-10-10 15:46:11,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=407260.0, ans=0.07 2023-10-10 15:46:18,301 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.733e+02 1.916e+02 2.188e+02 3.258e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-10 15:46:24,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=407306.6666666667, ans=0.125 2023-10-10 15:46:29,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=407306.6666666667, ans=0.09899494936611666 2023-10-10 15:47:09,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-10-10 15:47:15,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=407493.3333333333, ans=0.125 2023-10-10 15:47:16,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=407493.3333333333, ans=0.0 2023-10-10 15:47:24,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=407540.0, ans=0.2 2023-10-10 15:47:30,623 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.28 vs. limit=15.0 2023-10-10 15:47:32,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=407586.6666666667, ans=0.125 2023-10-10 15:47:32,488 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.36 vs. limit=22.5 2023-10-10 15:47:42,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-10-10 15:47:50,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407680.0, ans=0.1 2023-10-10 15:47:51,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=407680.0, ans=0.125 2023-10-10 15:48:03,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=407726.6666666667, ans=0.125 2023-10-10 15:48:07,972 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.642e+02 1.810e+02 2.061e+02 2.865e+02, threshold=3.620e+02, percent-clipped=0.0 2023-10-10 15:48:15,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=407773.3333333333, ans=0.025 2023-10-10 15:48:26,152 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.84 vs. limit=15.0 2023-10-10 15:48:31,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=407820.0, ans=0.2 2023-10-10 15:48:32,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=407820.0, ans=0.0 2023-10-10 15:48:39,669 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.13 vs. limit=10.0 2023-10-10 15:48:47,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=407913.3333333333, ans=0.0 2023-10-10 15:48:54,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407913.3333333333, ans=0.1 2023-10-10 15:48:55,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=407960.0, ans=0.0 2023-10-10 15:49:02,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=407960.0, ans=0.0 2023-10-10 15:49:06,002 INFO [train.py:1031] (0/4) Epoch 7, batch 5500, loss[loss=0.2375, simple_loss=0.318, pruned_loss=0.07851, over 16576.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3037, pruned_loss=0.06704, over 30736873.24 frames. ], batch size: 219, lr: 5.33e-03, grad_scale: 16.0 2023-10-10 15:49:19,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=408053.3333333333, ans=0.125 2023-10-10 15:49:22,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=408053.3333333333, ans=0.125 2023-10-10 15:49:55,007 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.711e+02 1.889e+02 2.163e+02 3.768e+02, threshold=3.778e+02, percent-clipped=1.0 2023-10-10 15:49:55,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.44 vs. limit=15.0 2023-10-10 15:50:02,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=408240.0, ans=0.125 2023-10-10 15:50:17,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=408286.6666666667, ans=0.125 2023-10-10 15:50:19,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=408286.6666666667, ans=0.1 2023-10-10 15:50:20,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=408333.3333333333, ans=0.1 2023-10-10 15:50:28,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=408333.3333333333, ans=0.0 2023-10-10 15:50:38,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=408380.0, ans=0.2 2023-10-10 15:50:54,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=408473.3333333333, ans=0.0 2023-10-10 15:51:00,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=408473.3333333333, ans=0.0 2023-10-10 15:51:03,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408520.0, ans=0.1 2023-10-10 15:51:16,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=408566.6666666667, ans=0.125 2023-10-10 15:51:21,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408566.6666666667, ans=0.1 2023-10-10 15:51:21,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=408566.6666666667, ans=0.5 2023-10-10 15:51:23,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=408566.6666666667, ans=0.0 2023-10-10 15:51:23,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=408566.6666666667, ans=0.0 2023-10-10 15:51:30,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=408613.3333333333, ans=0.0 2023-10-10 15:51:44,254 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.648e+02 1.839e+02 2.057e+02 3.027e+02, threshold=3.678e+02, percent-clipped=0.0 2023-10-10 15:51:55,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=408706.6666666667, ans=0.2 2023-10-10 15:51:59,235 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 15:52:12,291 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=22.5 2023-10-10 15:52:17,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=408800.0, ans=0.95 2023-10-10 15:52:17,673 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.88 vs. limit=15.0 2023-10-10 15:52:18,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=408800.0, ans=0.0 2023-10-10 15:53:09,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=409033.3333333333, ans=0.09899494936611666 2023-10-10 15:53:25,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=409080.0, ans=0.125 2023-10-10 15:53:37,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.679e+02 1.861e+02 1.994e+02 2.762e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-10 15:53:45,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-10-10 15:53:48,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=409173.3333333333, ans=0.2 2023-10-10 15:54:00,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=409220.0, ans=0.125 2023-10-10 15:54:12,405 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.17 vs. limit=22.5 2023-10-10 15:54:22,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=409313.3333333333, ans=0.1 2023-10-10 15:54:29,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=409360.0, ans=0.1 2023-10-10 15:54:49,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=409453.3333333333, ans=0.125 2023-10-10 15:54:59,163 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=15.0 2023-10-10 15:55:02,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=409500.0, ans=0.0 2023-10-10 15:55:31,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.698e+02 1.887e+02 2.124e+02 4.294e+02, threshold=3.774e+02, percent-clipped=1.0 2023-10-10 15:55:33,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=409593.3333333333, ans=0.0 2023-10-10 15:56:00,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=409733.3333333333, ans=0.125 2023-10-10 15:56:07,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409780.0, ans=0.1 2023-10-10 15:56:16,333 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.55 vs. limit=10.0 2023-10-10 15:56:43,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=409920.0, ans=0.1 2023-10-10 15:56:53,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.52 vs. limit=12.0 2023-10-10 15:56:56,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=409966.6666666667, ans=0.0 2023-10-10 15:57:01,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=409966.6666666667, ans=0.0 2023-10-10 15:57:09,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=410013.3333333333, ans=0.0 2023-10-10 15:57:20,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=410060.0, ans=0.125 2023-10-10 15:57:22,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.749e+02 2.025e+02 2.331e+02 3.574e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-10 15:57:47,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=410200.0, ans=0.125 2023-10-10 15:57:47,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=28.76 vs. limit=22.5 2023-10-10 15:57:48,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=410200.0, ans=0.0 2023-10-10 15:58:07,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=410246.6666666667, ans=0.125 2023-10-10 15:58:20,526 INFO [train.py:1031] (0/4) Epoch 7, batch 6000, loss[loss=0.2127, simple_loss=0.2987, pruned_loss=0.06334, over 16952.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.3039, pruned_loss=0.06724, over 31180654.63 frames. ], batch size: 123, lr: 5.32e-03, grad_scale: 32.0 2023-10-10 15:58:34,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=410386.6666666667, ans=0.125 2023-10-10 15:58:36,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=410386.6666666667, ans=0.0 2023-10-10 15:58:46,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.63 vs. limit=15.0 2023-10-10 15:59:00,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=410480.0, ans=0.125 2023-10-10 15:59:15,315 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.766e+02 1.906e+02 2.058e+02 2.859e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-10 15:59:19,732 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.81 vs. limit=22.5 2023-10-10 15:59:33,432 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.94 vs. limit=15.0 2023-10-10 15:59:38,707 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-88000.pt 2023-10-10 15:59:43,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-10-10 15:59:52,426 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.64 vs. limit=10.0 2023-10-10 15:59:57,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=410713.3333333333, ans=0.125 2023-10-10 16:00:02,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410713.3333333333, ans=0.1 2023-10-10 16:00:06,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=410760.0, ans=0.2 2023-10-10 16:00:59,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=410993.3333333333, ans=0.125 2023-10-10 16:01:05,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=410993.3333333333, ans=0.0 2023-10-10 16:01:07,003 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.242e+02 1.683e+02 1.914e+02 2.126e+02 3.094e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-10 16:01:10,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-10-10 16:01:12,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=411040.0, ans=0.0 2023-10-10 16:01:33,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=411133.3333333333, ans=0.1 2023-10-10 16:01:38,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=411133.3333333333, ans=0.125 2023-10-10 16:01:46,646 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.92 vs. limit=22.5 2023-10-10 16:01:56,662 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.36 vs. limit=10.0 2023-10-10 16:01:57,990 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.53 vs. limit=22.5 2023-10-10 16:02:00,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=411226.6666666667, ans=0.0 2023-10-10 16:02:06,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=411273.3333333333, ans=0.1 2023-10-10 16:02:15,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=411320.0, ans=0.0 2023-10-10 16:02:56,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.839e+02 2.120e+02 2.440e+02 3.944e+02, threshold=4.241e+02, percent-clipped=1.0 2023-10-10 16:03:13,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=411553.3333333333, ans=0.2 2023-10-10 16:03:17,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=411553.3333333333, ans=0.125 2023-10-10 16:03:19,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=411553.3333333333, ans=0.0 2023-10-10 16:03:33,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=411646.6666666667, ans=0.125 2023-10-10 16:03:34,784 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.74 vs. limit=22.5 2023-10-10 16:03:41,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=411646.6666666667, ans=0.025 2023-10-10 16:04:06,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=411786.6666666667, ans=0.125 2023-10-10 16:04:14,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=411786.6666666667, ans=0.125 2023-10-10 16:04:17,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=411833.3333333333, ans=0.125 2023-10-10 16:04:21,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=411833.3333333333, ans=0.0 2023-10-10 16:04:32,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=411880.0, ans=0.0 2023-10-10 16:04:36,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=411880.0, ans=0.0 2023-10-10 16:04:47,495 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.59 vs. limit=15.0 2023-10-10 16:04:48,391 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.39 vs. limit=6.0 2023-10-10 16:04:49,498 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.768e+02 1.961e+02 2.265e+02 3.462e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-10 16:04:54,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=411973.3333333333, ans=0.125 2023-10-10 16:05:00,740 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.30 vs. limit=15.0 2023-10-10 16:05:20,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=412066.6666666667, ans=0.0 2023-10-10 16:05:25,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412066.6666666667, ans=0.1 2023-10-10 16:05:26,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=412066.6666666667, ans=0.125 2023-10-10 16:05:34,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=412113.3333333333, ans=0.0 2023-10-10 16:05:42,138 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.03 vs. limit=12.0 2023-10-10 16:05:55,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=412206.6666666667, ans=0.125 2023-10-10 16:05:56,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=412206.6666666667, ans=0.0 2023-10-10 16:06:14,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.59 vs. limit=15.0 2023-10-10 16:06:30,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412346.6666666667, ans=0.1 2023-10-10 16:06:32,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412346.6666666667, ans=0.1 2023-10-10 16:06:46,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.628e+02 1.866e+02 2.138e+02 3.487e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-10 16:07:08,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=412533.3333333333, ans=0.0 2023-10-10 16:07:09,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=412533.3333333333, ans=0.0 2023-10-10 16:07:21,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.70 vs. limit=15.0 2023-10-10 16:07:22,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=412580.0, ans=0.125 2023-10-10 16:07:27,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=412580.0, ans=0.025 2023-10-10 16:07:44,837 INFO [train.py:1031] (0/4) Epoch 7, batch 6500, loss[loss=0.2024, simple_loss=0.2865, pruned_loss=0.05914, over 16322.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3041, pruned_loss=0.06723, over 31522713.89 frames. ], batch size: 50, lr: 5.30e-03, grad_scale: 32.0 2023-10-10 16:08:01,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=22.5 2023-10-10 16:08:06,149 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.21 vs. limit=22.5 2023-10-10 16:08:15,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=412766.6666666667, ans=0.2 2023-10-10 16:08:18,521 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:08:29,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=412813.3333333333, ans=0.0 2023-10-10 16:08:45,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=412860.0, ans=0.04949747468305833 2023-10-10 16:08:50,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.810e+02 1.926e+02 2.168e+02 3.611e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-10 16:09:04,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=412953.3333333333, ans=0.125 2023-10-10 16:09:11,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=412953.3333333333, ans=0.5 2023-10-10 16:09:18,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=413000.0, ans=0.0 2023-10-10 16:09:18,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=413000.0, ans=0.0 2023-10-10 16:09:22,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-10-10 16:09:26,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=413046.6666666667, ans=0.2 2023-10-10 16:09:37,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=413093.3333333333, ans=0.2 2023-10-10 16:10:26,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=413280.0, ans=0.0 2023-10-10 16:10:29,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413280.0, ans=0.1 2023-10-10 16:10:42,140 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.727e+02 1.947e+02 2.416e+02 4.561e+02, threshold=3.895e+02, percent-clipped=1.0 2023-10-10 16:10:52,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=413373.3333333333, ans=0.025 2023-10-10 16:10:59,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413420.0, ans=0.1 2023-10-10 16:11:23,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=413513.3333333333, ans=0.1 2023-10-10 16:11:34,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=413560.0, ans=0.125 2023-10-10 16:11:36,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=413606.6666666667, ans=0.0 2023-10-10 16:11:38,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=413606.6666666667, ans=0.125 2023-10-10 16:11:39,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=413606.6666666667, ans=0.125 2023-10-10 16:12:17,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=413746.6666666667, ans=0.125 2023-10-10 16:12:21,447 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.70 vs. limit=6.0 2023-10-10 16:12:22,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=413793.3333333333, ans=0.0 2023-10-10 16:12:22,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=413793.3333333333, ans=10.0 2023-10-10 16:12:24,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=413793.3333333333, ans=0.025 2023-10-10 16:12:27,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=413793.3333333333, ans=0.125 2023-10-10 16:12:28,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=413793.3333333333, ans=0.125 2023-10-10 16:12:28,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.76 vs. limit=22.5 2023-10-10 16:12:32,220 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.746e+02 1.903e+02 2.149e+02 3.646e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 16:12:32,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=413793.3333333333, ans=0.2 2023-10-10 16:12:36,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.65 vs. limit=22.5 2023-10-10 16:12:37,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=413840.0, ans=0.025 2023-10-10 16:12:37,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=413840.0, ans=0.125 2023-10-10 16:12:54,479 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.14 vs. limit=15.0 2023-10-10 16:13:19,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=413980.0, ans=0.125 2023-10-10 16:13:20,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=413980.0, ans=0.125 2023-10-10 16:13:33,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=12.0 2023-10-10 16:13:43,177 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.07 vs. limit=15.0 2023-10-10 16:13:56,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=414120.0, ans=0.04949747468305833 2023-10-10 16:14:20,494 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.69 vs. limit=15.0 2023-10-10 16:14:29,794 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.70 vs. limit=22.5 2023-10-10 16:14:43,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.620e+02 1.783e+02 2.073e+02 2.507e+02, threshold=3.566e+02, percent-clipped=0.0 2023-10-10 16:14:59,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.56 vs. limit=15.0 2023-10-10 16:15:02,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=414353.3333333333, ans=0.125 2023-10-10 16:15:11,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=414400.0, ans=0.125 2023-10-10 16:15:12,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=414400.0, ans=0.2 2023-10-10 16:15:31,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=414446.6666666667, ans=0.0 2023-10-10 16:15:33,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=414493.3333333333, ans=0.125 2023-10-10 16:15:33,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=414493.3333333333, ans=0.125 2023-10-10 16:15:35,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-10-10 16:15:48,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=414540.0, ans=0.125 2023-10-10 16:15:57,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=414586.6666666667, ans=0.125 2023-10-10 16:16:00,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=414586.6666666667, ans=0.05 2023-10-10 16:16:30,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.17 vs. limit=15.0 2023-10-10 16:16:32,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=414726.6666666667, ans=0.125 2023-10-10 16:16:32,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=414726.6666666667, ans=0.125 2023-10-10 16:16:37,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.708e+02 1.919e+02 2.096e+02 3.386e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-10 16:17:04,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=414866.6666666667, ans=0.04949747468305833 2023-10-10 16:17:07,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=414866.6666666667, ans=0.0 2023-10-10 16:17:31,354 INFO [train.py:1031] (0/4) Epoch 7, batch 7000, loss[loss=0.2299, simple_loss=0.3156, pruned_loss=0.07208, over 16613.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3043, pruned_loss=0.06711, over 31777947.22 frames. ], batch size: 56, lr: 5.29e-03, grad_scale: 32.0 2023-10-10 16:17:39,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=415006.6666666667, ans=0.125 2023-10-10 16:17:40,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=415006.6666666667, ans=0.1 2023-10-10 16:17:51,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415053.3333333333, ans=0.1 2023-10-10 16:17:53,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=415053.3333333333, ans=0.125 2023-10-10 16:18:27,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.716e+02 1.901e+02 2.262e+02 3.962e+02, threshold=3.802e+02, percent-clipped=1.0 2023-10-10 16:18:43,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415286.6666666667, ans=0.1 2023-10-10 16:18:45,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.93 vs. limit=15.0 2023-10-10 16:18:48,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=415286.6666666667, ans=0.125 2023-10-10 16:18:52,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=415333.3333333333, ans=0.125 2023-10-10 16:19:10,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=415380.0, ans=0.125 2023-10-10 16:19:16,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415426.6666666667, ans=0.1 2023-10-10 16:19:46,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.60 vs. limit=15.0 2023-10-10 16:19:50,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=415566.6666666667, ans=0.2 2023-10-10 16:19:54,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=415566.6666666667, ans=0.125 2023-10-10 16:19:54,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=415566.6666666667, ans=0.0 2023-10-10 16:20:01,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=415613.3333333333, ans=15.0 2023-10-10 16:20:04,337 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-10-10 16:20:12,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=415660.0, ans=0.2 2023-10-10 16:20:19,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.765e+02 2.001e+02 2.380e+02 3.440e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-10 16:20:28,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=415706.6666666667, ans=0.0 2023-10-10 16:20:29,554 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-10-10 16:20:33,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=415753.3333333333, ans=0.2 2023-10-10 16:20:37,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415753.3333333333, ans=0.1 2023-10-10 16:21:29,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.71 vs. limit=22.5 2023-10-10 16:21:36,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=415986.6666666667, ans=0.05 2023-10-10 16:21:43,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=415986.6666666667, ans=0.125 2023-10-10 16:21:53,060 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.98 vs. limit=6.0 2023-10-10 16:22:26,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.669e+02 1.836e+02 2.091e+02 3.500e+02, threshold=3.672e+02, percent-clipped=0.0 2023-10-10 16:22:45,891 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.22 vs. limit=10.0 2023-10-10 16:22:50,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=416220.0, ans=0.125 2023-10-10 16:22:56,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=416266.6666666667, ans=0.125 2023-10-10 16:23:11,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=416313.3333333333, ans=0.07 2023-10-10 16:23:11,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=416313.3333333333, ans=0.07 2023-10-10 16:23:22,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=416360.0, ans=0.0 2023-10-10 16:23:58,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=416500.0, ans=0.125 2023-10-10 16:24:12,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=22.5 2023-10-10 16:24:13,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=416546.6666666667, ans=0.0 2023-10-10 16:24:24,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=416593.3333333333, ans=0.125 2023-10-10 16:24:31,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.655e+02 1.843e+02 2.140e+02 2.855e+02, threshold=3.686e+02, percent-clipped=0.0 2023-10-10 16:25:06,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=416733.3333333333, ans=0.125 2023-10-10 16:25:11,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=416780.0, ans=0.125 2023-10-10 16:25:23,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=416826.6666666667, ans=0.125 2023-10-10 16:25:24,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=416826.6666666667, ans=0.125 2023-10-10 16:25:35,774 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.97 vs. limit=22.5 2023-10-10 16:25:37,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=416873.3333333333, ans=0.125 2023-10-10 16:25:50,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=416920.0, ans=0.0 2023-10-10 16:26:03,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=416966.6666666667, ans=0.125 2023-10-10 16:26:05,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=416966.6666666667, ans=0.125 2023-10-10 16:26:21,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=417060.0, ans=0.125 2023-10-10 16:26:29,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.294e+02 1.702e+02 1.874e+02 2.042e+02 3.327e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-10 16:26:29,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=417106.6666666667, ans=0.125 2023-10-10 16:26:33,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=417106.6666666667, ans=0.125 2023-10-10 16:26:41,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=417153.3333333333, ans=0.125 2023-10-10 16:26:42,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=417153.3333333333, ans=0.0 2023-10-10 16:26:54,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=417200.0, ans=0.0 2023-10-10 16:27:23,666 INFO [train.py:1031] (0/4) Epoch 7, batch 7500, loss[loss=0.2356, simple_loss=0.3126, pruned_loss=0.07929, over 16603.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.3041, pruned_loss=0.06717, over 31988311.79 frames. ], batch size: 219, lr: 5.27e-03, grad_scale: 16.0 2023-10-10 16:27:33,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=417386.6666666667, ans=0.0 2023-10-10 16:27:42,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.94 vs. limit=15.0 2023-10-10 16:27:53,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=417433.3333333333, ans=0.07 2023-10-10 16:27:53,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=417433.3333333333, ans=0.04949747468305833 2023-10-10 16:28:01,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=417480.0, ans=15.0 2023-10-10 16:28:09,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=417480.0, ans=0.125 2023-10-10 16:28:17,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=417526.6666666667, ans=0.2 2023-10-10 16:28:24,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.759e+02 2.037e+02 2.394e+02 3.927e+02, threshold=4.075e+02, percent-clipped=1.0 2023-10-10 16:28:41,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.41 vs. limit=15.0 2023-10-10 16:28:52,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=417666.6666666667, ans=0.125 2023-10-10 16:28:52,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=417666.6666666667, ans=0.125 2023-10-10 16:28:58,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=417666.6666666667, ans=0.0 2023-10-10 16:29:31,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=417806.6666666667, ans=0.125 2023-10-10 16:29:31,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=417806.6666666667, ans=0.1 2023-10-10 16:29:42,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=417853.3333333333, ans=0.1 2023-10-10 16:29:48,333 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2023-10-10 16:29:49,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=417900.0, ans=0.0 2023-10-10 16:29:58,583 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.67 vs. limit=10.0 2023-10-10 16:30:23,419 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.45 vs. limit=15.0 2023-10-10 16:30:31,075 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.707e+02 1.888e+02 2.168e+02 3.818e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-10 16:30:43,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=418086.6666666667, ans=0.0 2023-10-10 16:30:47,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=418086.6666666667, ans=0.2 2023-10-10 16:31:00,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=418133.3333333333, ans=0.125 2023-10-10 16:31:01,502 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-10-10 16:31:11,332 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.79 vs. limit=6.0 2023-10-10 16:31:26,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=418226.6666666667, ans=0.0 2023-10-10 16:31:37,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=418273.3333333333, ans=0.0 2023-10-10 16:32:05,730 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:32:14,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=418413.3333333333, ans=0.0 2023-10-10 16:32:32,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.283e+02 1.738e+02 1.979e+02 2.253e+02 5.503e+02, threshold=3.958e+02, percent-clipped=2.0 2023-10-10 16:32:54,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418600.0, ans=0.1 2023-10-10 16:33:02,206 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=22.5 2023-10-10 16:33:09,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=418646.6666666667, ans=0.125 2023-10-10 16:33:11,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=418646.6666666667, ans=0.2 2023-10-10 16:33:12,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=418646.6666666667, ans=0.125 2023-10-10 16:33:12,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=418646.6666666667, ans=0.125 2023-10-10 16:33:23,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=418693.3333333333, ans=0.125 2023-10-10 16:33:29,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=418740.0, ans=0.5 2023-10-10 16:33:31,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.63 vs. limit=15.0 2023-10-10 16:33:31,912 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-10-10 16:33:41,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=418786.6666666667, ans=0.0 2023-10-10 16:33:52,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-10-10 16:34:07,603 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-10-10 16:34:19,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=418880.0, ans=0.2 2023-10-10 16:34:29,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=418926.6666666667, ans=0.0 2023-10-10 16:34:31,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=418926.6666666667, ans=0.125 2023-10-10 16:34:34,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.768e+02 1.946e+02 2.206e+02 2.921e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-10 16:34:39,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=418973.3333333333, ans=0.2 2023-10-10 16:35:03,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=419066.6666666667, ans=0.125 2023-10-10 16:35:09,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=419113.3333333333, ans=0.0 2023-10-10 16:35:20,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-10-10 16:35:32,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-10-10 16:36:24,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=419440.0, ans=0.125 2023-10-10 16:36:25,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.618e+02 1.800e+02 2.097e+02 3.234e+02, threshold=3.600e+02, percent-clipped=0.0 2023-10-10 16:36:26,463 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.60 vs. limit=22.5 2023-10-10 16:36:31,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=419440.0, ans=0.125 2023-10-10 16:36:32,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=419440.0, ans=0.0 2023-10-10 16:37:03,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419580.0, ans=0.1 2023-10-10 16:37:12,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=419580.0, ans=0.125 2023-10-10 16:37:21,155 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.69 vs. limit=15.0 2023-10-10 16:37:25,160 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.42 vs. limit=10.0 2023-10-10 16:37:26,153 INFO [train.py:1031] (0/4) Epoch 7, batch 8000, loss[loss=0.1777, simple_loss=0.2748, pruned_loss=0.04029, over 16925.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.3035, pruned_loss=0.06645, over 32184184.96 frames. ], batch size: 104, lr: 5.26e-03, grad_scale: 32.0 2023-10-10 16:37:34,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=419673.3333333333, ans=0.0 2023-10-10 16:37:36,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=419673.3333333333, ans=0.125 2023-10-10 16:37:46,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419720.0, ans=0.1 2023-10-10 16:37:55,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=419766.6666666667, ans=0.125 2023-10-10 16:38:20,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.595e+02 1.698e+02 1.908e+02 3.006e+02, threshold=3.396e+02, percent-clipped=0.0 2023-10-10 16:38:30,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=419906.6666666667, ans=0.5 2023-10-10 16:38:39,237 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.18 vs. limit=15.0 2023-10-10 16:38:46,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=420000.0, ans=0.1 2023-10-10 16:39:00,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=420046.6666666667, ans=0.125 2023-10-10 16:39:30,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=420186.6666666667, ans=0.0 2023-10-10 16:40:16,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=420326.6666666667, ans=0.0 2023-10-10 16:40:18,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=420373.3333333333, ans=0.2 2023-10-10 16:40:19,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.775e+02 1.980e+02 2.191e+02 3.122e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-10 16:40:41,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=420420.0, ans=0.2 2023-10-10 16:40:58,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=420466.6666666667, ans=0.2 2023-10-10 16:41:11,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=420513.3333333333, ans=0.125 2023-10-10 16:41:32,926 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.01 vs. limit=15.0 2023-10-10 16:41:47,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=420653.3333333333, ans=0.025 2023-10-10 16:41:55,973 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.92 vs. limit=12.0 2023-10-10 16:42:33,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=420793.3333333333, ans=0.0 2023-10-10 16:42:38,284 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.638e+02 1.821e+02 2.069e+02 3.139e+02, threshold=3.641e+02, percent-clipped=0.0 2023-10-10 16:42:43,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=420840.0, ans=0.1 2023-10-10 16:42:45,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=420840.0, ans=0.05 2023-10-10 16:42:54,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=420886.6666666667, ans=0.95 2023-10-10 16:42:56,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=420886.6666666667, ans=0.125 2023-10-10 16:43:01,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=420933.3333333333, ans=0.125 2023-10-10 16:43:17,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=420980.0, ans=0.125 2023-10-10 16:43:19,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=420980.0, ans=0.07 2023-10-10 16:43:27,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=421026.6666666667, ans=0.125 2023-10-10 16:43:34,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=421026.6666666667, ans=0.09899494936611666 2023-10-10 16:43:43,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=421073.3333333333, ans=0.1 2023-10-10 16:43:44,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=421073.3333333333, ans=0.125 2023-10-10 16:43:44,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=421073.3333333333, ans=0.125 2023-10-10 16:43:44,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=421073.3333333333, ans=0.2 2023-10-10 16:43:46,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=421073.3333333333, ans=0.1 2023-10-10 16:44:00,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=421120.0, ans=0.0 2023-10-10 16:44:19,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=421213.3333333333, ans=0.125 2023-10-10 16:44:19,062 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:44:43,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=421306.6666666667, ans=0.125 2023-10-10 16:44:43,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.700e+02 1.904e+02 2.164e+02 3.559e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 16:45:09,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=421400.0, ans=0.125 2023-10-10 16:45:28,414 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.67 vs. limit=15.0 2023-10-10 16:45:44,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=421540.0, ans=0.1 2023-10-10 16:45:59,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=421586.6666666667, ans=0.125 2023-10-10 16:46:09,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=421633.3333333333, ans=0.125 2023-10-10 16:46:09,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=421633.3333333333, ans=0.0 2023-10-10 16:46:18,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=421633.3333333333, ans=0.1 2023-10-10 16:46:22,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=421680.0, ans=0.0 2023-10-10 16:46:45,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.713e+02 1.887e+02 2.097e+02 3.418e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-10 16:46:49,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=421773.3333333333, ans=0.1 2023-10-10 16:46:57,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=421820.0, ans=0.09899494936611666 2023-10-10 16:47:10,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=421866.6666666667, ans=0.1 2023-10-10 16:47:27,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=421913.3333333333, ans=0.125 2023-10-10 16:47:39,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=421960.0, ans=0.0 2023-10-10 16:47:41,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=421960.0, ans=0.125 2023-10-10 16:47:42,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=421960.0, ans=0.0 2023-10-10 16:47:47,097 INFO [train.py:1031] (0/4) Epoch 7, batch 8500, loss[loss=0.2179, simple_loss=0.3043, pruned_loss=0.06573, over 16879.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3039, pruned_loss=0.06651, over 32317042.90 frames. ], batch size: 130, lr: 5.25e-03, grad_scale: 32.0 2023-10-10 16:48:11,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=422100.0, ans=0.125 2023-10-10 16:48:22,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=422146.6666666667, ans=0.2 2023-10-10 16:48:23,508 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.74 vs. limit=12.0 2023-10-10 16:48:29,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=422146.6666666667, ans=10.0 2023-10-10 16:48:46,713 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.92 vs. limit=10.0 2023-10-10 16:48:46,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.289e+02 1.693e+02 1.852e+02 2.060e+02 3.508e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-10 16:48:59,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=422286.6666666667, ans=0.125 2023-10-10 16:49:02,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=422286.6666666667, ans=0.125 2023-10-10 16:49:07,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=422286.6666666667, ans=0.125 2023-10-10 16:49:17,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=422333.3333333333, ans=0.125 2023-10-10 16:49:38,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=422426.6666666667, ans=0.125 2023-10-10 16:49:46,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=422426.6666666667, ans=0.0 2023-10-10 16:49:55,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=422473.3333333333, ans=0.125 2023-10-10 16:50:08,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=422520.0, ans=0.2 2023-10-10 16:50:23,745 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-10-10 16:50:28,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2023-10-10 16:50:36,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=422613.3333333333, ans=0.125 2023-10-10 16:50:38,009 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.66 vs. limit=22.5 2023-10-10 16:50:49,313 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.07 vs. limit=22.5 2023-10-10 16:50:58,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.295e+02 1.621e+02 1.814e+02 2.057e+02 3.389e+02, threshold=3.628e+02, percent-clipped=0.0 2023-10-10 16:51:21,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=422753.3333333333, ans=0.2 2023-10-10 16:51:21,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=422753.3333333333, ans=0.0 2023-10-10 16:51:24,376 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.78 vs. limit=10.0 2023-10-10 16:51:31,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=422800.0, ans=0.09899494936611666 2023-10-10 16:51:48,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=422846.6666666667, ans=0.0 2023-10-10 16:52:00,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=422893.3333333333, ans=0.125 2023-10-10 16:52:11,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=422940.0, ans=0.2 2023-10-10 16:52:28,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=422986.6666666667, ans=0.1 2023-10-10 16:53:04,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=423126.6666666667, ans=0.0 2023-10-10 16:53:05,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=423126.6666666667, ans=0.0 2023-10-10 16:53:14,013 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.673e+02 1.829e+02 2.197e+02 3.613e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-10 16:53:23,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=423173.3333333333, ans=0.125 2023-10-10 16:53:55,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=423313.3333333333, ans=0.125 2023-10-10 16:53:55,412 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=15.0 2023-10-10 16:53:59,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=423313.3333333333, ans=0.125 2023-10-10 16:54:07,301 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.18 vs. limit=12.0 2023-10-10 16:54:19,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=423406.6666666667, ans=0.125 2023-10-10 16:54:20,235 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.82 vs. limit=15.0 2023-10-10 16:54:33,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=423453.3333333333, ans=0.2 2023-10-10 16:55:13,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.22 vs. limit=15.0 2023-10-10 16:55:18,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.742e+02 2.092e+02 2.496e+02 3.647e+02, threshold=4.184e+02, percent-clipped=0.0 2023-10-10 16:55:42,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=423733.3333333333, ans=0.125 2023-10-10 16:55:43,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=12.0 2023-10-10 16:55:44,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=423733.3333333333, ans=0.125 2023-10-10 16:55:47,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=423733.3333333333, ans=0.2 2023-10-10 16:56:18,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=423873.3333333333, ans=0.125 2023-10-10 16:56:24,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=423873.3333333333, ans=0.2 2023-10-10 16:56:24,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=423873.3333333333, ans=0.0 2023-10-10 16:56:41,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=423966.6666666667, ans=0.125 2023-10-10 16:56:56,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=424013.3333333333, ans=0.05 2023-10-10 16:57:03,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=424060.0, ans=0.125 2023-10-10 16:57:10,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.729e+02 1.875e+02 2.213e+02 2.767e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-10 16:57:30,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=424153.3333333333, ans=0.0 2023-10-10 16:57:44,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=424200.0, ans=0.1 2023-10-10 16:57:46,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=424246.6666666667, ans=0.125 2023-10-10 16:57:46,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=424246.6666666667, ans=0.0 2023-10-10 16:57:48,357 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=15.0 2023-10-10 16:57:55,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.11 vs. limit=12.0 2023-10-10 16:58:04,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424293.3333333333, ans=0.1 2023-10-10 16:58:07,665 INFO [train.py:1031] (0/4) Epoch 7, batch 9000, loss[loss=0.2981, simple_loss=0.3515, pruned_loss=0.1224, over 15667.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.3032, pruned_loss=0.06623, over 32430785.00 frames. ], batch size: 350, lr: 5.23e-03, grad_scale: 32.0 2023-10-10 16:58:22,843 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:58:24,719 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 16:58:43,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=424480.0, ans=0.2 2023-10-10 16:58:54,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=424526.6666666667, ans=0.125 2023-10-10 16:59:04,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=424526.6666666667, ans=0.125 2023-10-10 16:59:06,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.711e+02 1.861e+02 2.076e+02 3.072e+02, threshold=3.722e+02, percent-clipped=0.0 2023-10-10 16:59:07,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=424573.3333333333, ans=0.2 2023-10-10 16:59:21,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=424620.0, ans=0.1 2023-10-10 16:59:48,009 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.07 vs. limit=15.0 2023-10-10 16:59:54,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=424760.0, ans=0.125 2023-10-10 17:00:00,931 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.86 vs. limit=10.0 2023-10-10 17:00:09,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=424806.6666666667, ans=0.0 2023-10-10 17:00:24,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=424853.3333333333, ans=0.125 2023-10-10 17:00:30,892 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-10-10 17:00:35,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424900.0, ans=0.1 2023-10-10 17:00:45,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=424946.6666666667, ans=0.2 2023-10-10 17:00:50,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.26 vs. limit=15.0 2023-10-10 17:01:01,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.707e+02 1.847e+02 2.089e+02 2.824e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-10 17:01:10,762 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:01:12,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=425086.6666666667, ans=0.125 2023-10-10 17:01:13,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=425086.6666666667, ans=0.04949747468305833 2023-10-10 17:01:21,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=425133.3333333333, ans=0.09899494936611666 2023-10-10 17:01:34,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=425180.0, ans=0.125 2023-10-10 17:01:40,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=425180.0, ans=0.0 2023-10-10 17:01:45,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=425226.6666666667, ans=0.0 2023-10-10 17:01:53,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=425226.6666666667, ans=0.0 2023-10-10 17:01:53,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=425226.6666666667, ans=0.125 2023-10-10 17:02:06,131 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.68 vs. limit=15.0 2023-10-10 17:02:07,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.13 vs. limit=15.0 2023-10-10 17:02:27,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425366.6666666667, ans=0.1 2023-10-10 17:02:41,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=425460.0, ans=0.125 2023-10-10 17:02:51,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.76 vs. limit=12.0 2023-10-10 17:02:54,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.786e+02 1.965e+02 2.289e+02 3.111e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-10 17:03:06,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=425553.3333333333, ans=0.125 2023-10-10 17:03:19,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-10-10 17:03:25,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=425646.6666666667, ans=0.015 2023-10-10 17:03:28,649 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.00 vs. limit=22.5 2023-10-10 17:03:31,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=425646.6666666667, ans=0.2 2023-10-10 17:03:49,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=425740.0, ans=0.125 2023-10-10 17:03:49,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=425740.0, ans=0.125 2023-10-10 17:03:50,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=425740.0, ans=0.1 2023-10-10 17:04:04,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=425786.6666666667, ans=0.0 2023-10-10 17:04:32,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=425880.0, ans=0.125 2023-10-10 17:04:52,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.760e+02 1.964e+02 2.195e+02 3.007e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-10 17:05:22,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=426066.6666666667, ans=0.2 2023-10-10 17:05:32,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=426113.3333333333, ans=0.125 2023-10-10 17:05:44,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=426160.0, ans=0.0 2023-10-10 17:05:53,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=426206.6666666667, ans=15.0 2023-10-10 17:05:55,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.61 vs. limit=15.0 2023-10-10 17:06:16,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=426300.0, ans=0.125 2023-10-10 17:06:18,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=426300.0, ans=0.125 2023-10-10 17:06:19,246 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.62 vs. limit=15.0 2023-10-10 17:06:29,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=426346.6666666667, ans=0.1 2023-10-10 17:06:32,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-10-10 17:06:38,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=426346.6666666667, ans=0.125 2023-10-10 17:06:47,406 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-10-10 17:06:53,954 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.694e+02 1.863e+02 2.135e+02 3.320e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-10 17:07:01,122 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:07:04,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=426486.6666666667, ans=0.0 2023-10-10 17:07:05,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=426486.6666666667, ans=0.125 2023-10-10 17:07:09,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=426486.6666666667, ans=0.125 2023-10-10 17:07:11,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=426486.6666666667, ans=0.0 2023-10-10 17:07:23,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=426533.3333333333, ans=0.125 2023-10-10 17:07:39,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=426580.0, ans=0.0 2023-10-10 17:07:55,218 INFO [train.py:1031] (0/4) Epoch 7, batch 9500, loss[loss=0.2185, simple_loss=0.3109, pruned_loss=0.06303, over 16834.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3039, pruned_loss=0.06651, over 32499880.89 frames. ], batch size: 175, lr: 5.22e-03, grad_scale: 32.0 2023-10-10 17:08:13,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=426720.0, ans=0.125 2023-10-10 17:08:22,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=426766.6666666667, ans=0.0 2023-10-10 17:08:22,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426766.6666666667, ans=0.1 2023-10-10 17:08:31,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=426813.3333333333, ans=0.0 2023-10-10 17:08:36,903 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.17 vs. limit=15.0 2023-10-10 17:08:40,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=426860.0, ans=0.125 2023-10-10 17:08:55,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.706e+02 1.940e+02 2.208e+02 2.982e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-10 17:09:13,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=426953.3333333333, ans=0.125 2023-10-10 17:09:14,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=426953.3333333333, ans=0.2 2023-10-10 17:09:16,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=427000.0, ans=0.05 2023-10-10 17:09:25,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=427000.0, ans=0.0 2023-10-10 17:09:55,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=427140.0, ans=6.0 2023-10-10 17:09:58,470 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:10:00,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427140.0, ans=0.1 2023-10-10 17:10:01,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=427140.0, ans=15.0 2023-10-10 17:10:04,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.58 vs. limit=10.0 2023-10-10 17:10:20,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.74 vs. limit=15.0 2023-10-10 17:10:25,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427233.3333333333, ans=0.1 2023-10-10 17:10:34,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.05 vs. limit=22.5 2023-10-10 17:10:37,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=427280.0, ans=0.0 2023-10-10 17:10:51,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.680e+02 1.848e+02 2.237e+02 3.126e+02, threshold=3.695e+02, percent-clipped=0.0 2023-10-10 17:11:01,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=427420.0, ans=0.05 2023-10-10 17:11:02,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=427420.0, ans=0.125 2023-10-10 17:11:11,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=427420.0, ans=0.0 2023-10-10 17:11:22,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=427466.6666666667, ans=0.125 2023-10-10 17:11:25,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=427513.3333333333, ans=0.125 2023-10-10 17:12:06,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=427653.3333333333, ans=0.0 2023-10-10 17:12:44,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.637e+02 1.752e+02 2.013e+02 2.658e+02, threshold=3.504e+02, percent-clipped=0.0 2023-10-10 17:13:02,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=427886.6666666667, ans=0.125 2023-10-10 17:13:18,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=427933.3333333333, ans=0.125 2023-10-10 17:13:20,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=427933.3333333333, ans=0.125 2023-10-10 17:13:35,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=428026.6666666667, ans=0.125 2023-10-10 17:14:33,862 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.55 vs. limit=22.5 2023-10-10 17:14:45,146 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-10-10 17:15:00,209 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.686e+02 1.836e+02 2.068e+02 3.231e+02, threshold=3.672e+02, percent-clipped=0.0 2023-10-10 17:15:20,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=428400.0, ans=0.0 2023-10-10 17:15:27,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=428400.0, ans=0.0 2023-10-10 17:15:41,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=428446.6666666667, ans=0.0 2023-10-10 17:15:47,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=428493.3333333333, ans=0.125 2023-10-10 17:16:04,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=15.0 2023-10-10 17:16:25,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428633.3333333333, ans=0.1 2023-10-10 17:16:38,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=428680.0, ans=0.2 2023-10-10 17:16:42,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=428726.6666666667, ans=0.0 2023-10-10 17:16:53,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.700e+02 1.849e+02 2.113e+02 3.177e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-10 17:16:55,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-10-10 17:16:55,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=428773.3333333333, ans=0.1 2023-10-10 17:17:02,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=428820.0, ans=0.125 2023-10-10 17:17:29,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=428913.3333333333, ans=0.125 2023-10-10 17:17:48,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=428960.0, ans=0.125 2023-10-10 17:17:48,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=428960.0, ans=0.125 2023-10-10 17:17:49,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=429006.6666666667, ans=0.125 2023-10-10 17:17:50,771 INFO [train.py:1031] (0/4) Epoch 7, batch 10000, loss[loss=0.2781, simple_loss=0.3412, pruned_loss=0.1075, over 15997.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.3029, pruned_loss=0.0661, over 32548828.78 frames. ], batch size: 296, lr: 5.20e-03, grad_scale: 32.0 2023-10-10 17:17:54,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=429006.6666666667, ans=0.125 2023-10-10 17:18:30,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=429146.6666666667, ans=0.125 2023-10-10 17:18:32,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=429146.6666666667, ans=0.2 2023-10-10 17:18:36,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=429193.3333333333, ans=0.125 2023-10-10 17:18:47,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=429240.0, ans=0.05 2023-10-10 17:18:48,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.293e+02 1.772e+02 2.004e+02 2.323e+02 3.394e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-10 17:18:59,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.01 vs. limit=22.5 2023-10-10 17:19:06,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-10-10 17:19:41,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=429426.6666666667, ans=0.0 2023-10-10 17:19:47,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=429473.3333333333, ans=0.125 2023-10-10 17:19:51,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=429473.3333333333, ans=0.125 2023-10-10 17:19:57,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=429520.0, ans=0.125 2023-10-10 17:20:15,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=429566.6666666667, ans=0.0 2023-10-10 17:20:23,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=429613.3333333333, ans=0.0 2023-10-10 17:20:27,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=429613.3333333333, ans=0.2 2023-10-10 17:20:38,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=429706.6666666667, ans=0.125 2023-10-10 17:20:40,314 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.634e+02 1.822e+02 1.991e+02 2.580e+02, threshold=3.645e+02, percent-clipped=0.0 2023-10-10 17:20:46,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=429706.6666666667, ans=0.125 2023-10-10 17:20:49,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.57 vs. limit=15.0 2023-10-10 17:20:54,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=429753.3333333333, ans=0.125 2023-10-10 17:20:58,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=429753.3333333333, ans=0.025 2023-10-10 17:21:04,518 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-10-10 17:21:09,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=429800.0, ans=0.125 2023-10-10 17:21:18,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=429846.6666666667, ans=0.1 2023-10-10 17:21:19,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=429846.6666666667, ans=0.125 2023-10-10 17:21:22,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-10-10 17:21:23,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=429893.3333333333, ans=0.95 2023-10-10 17:21:24,900 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.76 vs. limit=15.0 2023-10-10 17:21:29,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=429893.3333333333, ans=0.0 2023-10-10 17:21:54,252 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.88 vs. limit=10.0 2023-10-10 17:22:09,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=430033.3333333333, ans=0.1 2023-10-10 17:22:35,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=430126.6666666667, ans=0.125 2023-10-10 17:22:37,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.731e+02 1.900e+02 2.200e+02 3.020e+02, threshold=3.800e+02, percent-clipped=0.0 2023-10-10 17:22:55,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=430220.0, ans=0.2 2023-10-10 17:22:59,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=430220.0, ans=0.2 2023-10-10 17:23:09,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=430266.6666666667, ans=0.125 2023-10-10 17:23:11,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=430266.6666666667, ans=0.07 2023-10-10 17:23:12,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=430266.6666666667, ans=0.0 2023-10-10 17:23:20,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2023-10-10 17:23:37,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-10-10 17:23:42,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=430406.6666666667, ans=0.95 2023-10-10 17:23:55,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=430453.3333333333, ans=10.0 2023-10-10 17:24:01,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=430453.3333333333, ans=0.125 2023-10-10 17:24:14,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.51 vs. limit=15.0 2023-10-10 17:24:27,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=430546.6666666667, ans=0.125 2023-10-10 17:24:33,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=430593.3333333333, ans=0.09899494936611666 2023-10-10 17:24:46,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.712e+02 1.934e+02 2.378e+02 3.327e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-10 17:24:50,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=430640.0, ans=0.125 2023-10-10 17:25:02,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=430686.6666666667, ans=0.125 2023-10-10 17:25:06,795 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=15.0 2023-10-10 17:25:18,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=430733.3333333333, ans=0.09899494936611666 2023-10-10 17:25:23,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=430780.0, ans=0.2 2023-10-10 17:25:26,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=430780.0, ans=0.125 2023-10-10 17:25:27,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=430780.0, ans=0.025 2023-10-10 17:25:34,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.65 vs. limit=15.0 2023-10-10 17:25:41,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=430826.6666666667, ans=0.1 2023-10-10 17:26:18,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.84 vs. limit=22.5 2023-10-10 17:26:19,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=431013.3333333333, ans=0.125 2023-10-10 17:26:22,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=431013.3333333333, ans=0.0 2023-10-10 17:26:43,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=431060.0, ans=0.0 2023-10-10 17:26:49,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.678e+02 1.918e+02 2.276e+02 3.202e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-10 17:26:51,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=431106.6666666667, ans=0.0 2023-10-10 17:26:57,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.36 vs. limit=22.5 2023-10-10 17:26:59,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=431153.3333333333, ans=0.125 2023-10-10 17:27:03,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=431153.3333333333, ans=0.0 2023-10-10 17:27:19,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.16 vs. limit=12.0 2023-10-10 17:27:22,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=431246.6666666667, ans=0.125 2023-10-10 17:27:44,621 INFO [train.py:1031] (0/4) Epoch 7, batch 10500, loss[loss=0.2265, simple_loss=0.3097, pruned_loss=0.07171, over 16495.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.3032, pruned_loss=0.06616, over 32570193.93 frames. ], batch size: 266, lr: 5.19e-03, grad_scale: 16.0 2023-10-10 17:27:49,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=431340.0, ans=0.125 2023-10-10 17:27:51,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=12.0 2023-10-10 17:27:52,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-10-10 17:28:03,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-10-10 17:28:23,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=431480.0, ans=0.125 2023-10-10 17:28:25,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=431480.0, ans=0.125 2023-10-10 17:28:30,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=431480.0, ans=0.1 2023-10-10 17:28:31,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=431526.6666666667, ans=0.0 2023-10-10 17:28:46,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.71 vs. limit=15.0 2023-10-10 17:28:46,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=431573.3333333333, ans=0.125 2023-10-10 17:28:49,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.695e+02 1.853e+02 2.056e+02 3.284e+02, threshold=3.706e+02, percent-clipped=0.0 2023-10-10 17:28:53,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=431573.3333333333, ans=0.125 2023-10-10 17:28:58,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=431620.0, ans=0.5 2023-10-10 17:29:09,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.23 vs. limit=15.0 2023-10-10 17:29:23,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=431666.6666666667, ans=0.2 2023-10-10 17:29:27,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=431713.3333333333, ans=0.0 2023-10-10 17:29:36,715 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.89 vs. limit=22.5 2023-10-10 17:30:07,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=431853.3333333333, ans=0.0 2023-10-10 17:30:20,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=431900.0, ans=0.07 2023-10-10 17:30:21,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=431900.0, ans=0.125 2023-10-10 17:30:45,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431993.3333333333, ans=0.1 2023-10-10 17:30:45,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=431993.3333333333, ans=0.2 2023-10-10 17:30:51,905 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.772e+02 1.967e+02 2.211e+02 3.232e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-10 17:31:05,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=432086.6666666667, ans=0.0 2023-10-10 17:31:07,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=432086.6666666667, ans=0.0 2023-10-10 17:31:10,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432133.3333333333, ans=0.1 2023-10-10 17:31:10,564 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=22.5 2023-10-10 17:31:24,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=432180.0, ans=0.2 2023-10-10 17:31:46,135 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-10 17:31:58,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=432273.3333333333, ans=0.0 2023-10-10 17:32:27,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=432413.3333333333, ans=0.2 2023-10-10 17:32:56,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432506.6666666667, ans=0.1 2023-10-10 17:32:58,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432506.6666666667, ans=0.1 2023-10-10 17:32:58,410 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.79 vs. limit=15.0 2023-10-10 17:32:58,733 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.729e+02 1.894e+02 2.136e+02 3.132e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-10 17:33:21,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=432600.0, ans=0.0 2023-10-10 17:33:22,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=432600.0, ans=0.125 2023-10-10 17:33:24,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=432600.0, ans=0.0 2023-10-10 17:33:24,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=432600.0, ans=0.125 2023-10-10 17:33:25,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.46 vs. limit=12.0 2023-10-10 17:33:33,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=432646.6666666667, ans=0.125 2023-10-10 17:33:33,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=432646.6666666667, ans=0.1 2023-10-10 17:33:53,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=432740.0, ans=0.2 2023-10-10 17:33:59,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=432786.6666666667, ans=0.125 2023-10-10 17:34:41,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=432926.6666666667, ans=0.125 2023-10-10 17:34:48,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432973.3333333333, ans=0.1 2023-10-10 17:34:49,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.740e+02 1.912e+02 2.264e+02 2.839e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-10 17:34:55,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=433020.0, ans=0.125 2023-10-10 17:35:13,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=433066.6666666667, ans=0.125 2023-10-10 17:35:51,822 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.64 vs. limit=10.0 2023-10-10 17:35:53,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=433253.3333333333, ans=0.0 2023-10-10 17:36:13,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=433300.0, ans=0.0 2023-10-10 17:36:23,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=433346.6666666667, ans=0.09899494936611666 2023-10-10 17:36:34,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=433393.3333333333, ans=0.0 2023-10-10 17:36:42,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.641e+02 1.793e+02 2.027e+02 3.111e+02, threshold=3.587e+02, percent-clipped=0.0 2023-10-10 17:37:13,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=433533.3333333333, ans=0.0 2023-10-10 17:37:13,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=433533.3333333333, ans=0.2 2023-10-10 17:37:40,397 INFO [train.py:1031] (0/4) Epoch 7, batch 11000, loss[loss=0.2245, simple_loss=0.3108, pruned_loss=0.06914, over 16935.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3032, pruned_loss=0.06608, over 32627249.90 frames. ], batch size: 77, lr: 5.17e-03, grad_scale: 32.0 2023-10-10 17:37:42,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-10-10 17:37:48,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=433673.3333333333, ans=0.125 2023-10-10 17:37:56,129 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.64 vs. limit=15.0 2023-10-10 17:38:03,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433766.6666666667, ans=0.1 2023-10-10 17:38:26,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.87 vs. limit=10.0 2023-10-10 17:38:36,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=433906.6666666667, ans=0.0 2023-10-10 17:38:41,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.731e+02 1.990e+02 2.250e+02 3.777e+02, threshold=3.979e+02, percent-clipped=1.0 2023-10-10 17:39:04,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=434000.0, ans=0.125 2023-10-10 17:39:07,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=434000.0, ans=0.025 2023-10-10 17:39:08,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-10-10 17:39:27,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=434093.3333333333, ans=0.125 2023-10-10 17:39:31,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=434093.3333333333, ans=0.125 2023-10-10 17:39:44,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=434140.0, ans=0.125 2023-10-10 17:39:50,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=434186.6666666667, ans=0.0 2023-10-10 17:40:00,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=434186.6666666667, ans=0.1 2023-10-10 17:40:00,671 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-10-10 17:40:07,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=434233.3333333333, ans=0.125 2023-10-10 17:40:13,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=434233.3333333333, ans=0.125 2023-10-10 17:40:26,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=434280.0, ans=0.125 2023-10-10 17:40:38,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434326.6666666667, ans=0.1 2023-10-10 17:40:46,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.601e+02 1.809e+02 2.087e+02 3.050e+02, threshold=3.619e+02, percent-clipped=0.0 2023-10-10 17:40:55,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=434420.0, ans=0.125 2023-10-10 17:40:57,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=434420.0, ans=0.2 2023-10-10 17:41:09,737 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.94 vs. limit=15.0 2023-10-10 17:41:12,932 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:41:19,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434513.3333333333, ans=0.1 2023-10-10 17:41:20,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=434513.3333333333, ans=0.125 2023-10-10 17:41:24,881 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.54 vs. limit=22.5 2023-10-10 17:42:09,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=12.0 2023-10-10 17:42:11,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=434746.6666666667, ans=0.1 2023-10-10 17:42:11,187 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:42:13,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=434746.6666666667, ans=0.2 2023-10-10 17:42:21,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=434793.3333333333, ans=0.1 2023-10-10 17:42:25,911 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.17 vs. limit=15.0 2023-10-10 17:42:37,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.680e+02 1.842e+02 2.036e+02 2.579e+02, threshold=3.683e+02, percent-clipped=0.0 2023-10-10 17:42:38,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=434840.0, ans=0.125 2023-10-10 17:42:40,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=434840.0, ans=0.2 2023-10-10 17:42:47,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=434886.6666666667, ans=0.0 2023-10-10 17:42:58,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=434933.3333333333, ans=0.1 2023-10-10 17:43:05,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=434933.3333333333, ans=0.0 2023-10-10 17:43:13,014 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=22.5 2023-10-10 17:43:16,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=434980.0, ans=0.05 2023-10-10 17:43:42,255 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.73 vs. limit=6.0 2023-10-10 17:43:51,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=435120.0, ans=0.0 2023-10-10 17:44:16,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=435213.3333333333, ans=0.035 2023-10-10 17:44:39,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435306.6666666667, ans=0.0 2023-10-10 17:44:44,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.706e+02 1.831e+02 2.217e+02 3.339e+02, threshold=3.663e+02, percent-clipped=0.0 2023-10-10 17:44:58,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=435353.3333333333, ans=0.125 2023-10-10 17:45:14,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=435446.6666666667, ans=0.125 2023-10-10 17:45:20,920 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:45:35,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=435540.0, ans=0.0 2023-10-10 17:45:54,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=435586.6666666667, ans=0.05 2023-10-10 17:45:55,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=435586.6666666667, ans=0.0 2023-10-10 17:45:58,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435633.3333333333, ans=0.1 2023-10-10 17:46:07,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=435633.3333333333, ans=0.0 2023-10-10 17:46:08,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=435633.3333333333, ans=0.2 2023-10-10 17:46:27,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-10-10 17:46:41,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=435773.3333333333, ans=0.125 2023-10-10 17:46:44,455 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.827e+02 2.026e+02 2.323e+02 3.333e+02, threshold=4.051e+02, percent-clipped=0.0 2023-10-10 17:46:55,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435820.0, ans=0.0 2023-10-10 17:47:00,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435866.6666666667, ans=0.0 2023-10-10 17:47:06,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=435866.6666666667, ans=0.0 2023-10-10 17:47:10,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=435913.3333333333, ans=0.125 2023-10-10 17:47:19,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=435913.3333333333, ans=0.125 2023-10-10 17:47:25,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=435960.0, ans=0.0 2023-10-10 17:47:36,614 INFO [train.py:1031] (0/4) Epoch 7, batch 11500, loss[loss=0.2037, simple_loss=0.3029, pruned_loss=0.05226, over 16953.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.3028, pruned_loss=0.06581, over 32662576.61 frames. ], batch size: 93, lr: 5.16e-03, grad_scale: 16.0 2023-10-10 17:47:58,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=436053.3333333333, ans=0.125 2023-10-10 17:48:05,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=436100.0, ans=0.125 2023-10-10 17:48:08,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436100.0, ans=0.1 2023-10-10 17:48:43,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.767e+02 1.974e+02 2.161e+02 3.123e+02, threshold=3.948e+02, percent-clipped=0.0 2023-10-10 17:48:55,471 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.34 vs. limit=6.0 2023-10-10 17:49:06,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=436333.3333333333, ans=0.2 2023-10-10 17:49:38,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=436426.6666666667, ans=0.125 2023-10-10 17:49:57,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=436520.0, ans=0.125 2023-10-10 17:50:22,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436613.3333333333, ans=0.1 2023-10-10 17:50:46,268 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.218e+02 1.611e+02 1.788e+02 2.012e+02 3.156e+02, threshold=3.576e+02, percent-clipped=0.0 2023-10-10 17:50:55,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=436753.3333333333, ans=0.025 2023-10-10 17:50:55,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=436753.3333333333, ans=0.125 2023-10-10 17:51:03,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=436800.0, ans=0.0 2023-10-10 17:51:07,894 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.541e-03 2023-10-10 17:51:08,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.90 vs. limit=15.0 2023-10-10 17:51:17,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436846.6666666667, ans=0.1 2023-10-10 17:51:30,734 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:51:58,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=436986.6666666667, ans=0.125 2023-10-10 17:52:03,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437033.3333333333, ans=0.1 2023-10-10 17:52:04,030 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-10-10 17:52:07,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=437033.3333333333, ans=0.125 2023-10-10 17:52:10,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=437033.3333333333, ans=10.0 2023-10-10 17:52:17,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=437080.0, ans=0.125 2023-10-10 17:52:25,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=437126.6666666667, ans=0.125 2023-10-10 17:52:45,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.305e+02 1.685e+02 1.821e+02 2.047e+02 3.340e+02, threshold=3.643e+02, percent-clipped=0.0 2023-10-10 17:53:02,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=437220.0, ans=0.015 2023-10-10 17:53:05,569 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:53:09,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=437266.6666666667, ans=0.0 2023-10-10 17:53:24,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=437313.3333333333, ans=0.125 2023-10-10 17:53:35,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=437360.0, ans=0.125 2023-10-10 17:53:41,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=437360.0, ans=0.0 2023-10-10 17:53:51,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=437406.6666666667, ans=0.125 2023-10-10 17:53:58,152 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-10-10 17:53:59,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=437406.6666666667, ans=0.2 2023-10-10 17:54:00,775 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-10-10 17:54:08,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=437453.3333333333, ans=0.0 2023-10-10 17:54:17,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=22.5 2023-10-10 17:54:26,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=437546.6666666667, ans=0.125 2023-10-10 17:54:29,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=437546.6666666667, ans=0.0 2023-10-10 17:54:34,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=437546.6666666667, ans=10.0 2023-10-10 17:54:35,310 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.47 vs. limit=15.0 2023-10-10 17:54:46,318 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.47 vs. limit=15.0 2023-10-10 17:54:53,118 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:54:55,891 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.56 vs. limit=15.0 2023-10-10 17:54:57,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.650e+02 1.874e+02 2.186e+02 3.784e+02, threshold=3.747e+02, percent-clipped=2.0 2023-10-10 17:55:14,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=437686.6666666667, ans=0.0 2023-10-10 17:55:20,545 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=15.0 2023-10-10 17:55:43,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=437826.6666666667, ans=0.0 2023-10-10 17:55:58,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.10 vs. limit=15.0 2023-10-10 17:56:00,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=437873.3333333333, ans=0.0 2023-10-10 17:56:06,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=437920.0, ans=0.125 2023-10-10 17:56:07,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=437920.0, ans=0.5 2023-10-10 17:56:18,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=437920.0, ans=0.125 2023-10-10 17:56:18,663 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.35 vs. limit=15.0 2023-10-10 17:56:48,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=438060.0, ans=0.125 2023-10-10 17:56:48,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=438060.0, ans=0.125 2023-10-10 17:56:50,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=438060.0, ans=0.125 2023-10-10 17:56:54,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=438060.0, ans=0.125 2023-10-10 17:56:58,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=438106.6666666667, ans=0.0 2023-10-10 17:56:58,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=438106.6666666667, ans=0.0 2023-10-10 17:56:59,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=438106.6666666667, ans=0.1 2023-10-10 17:57:04,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.724e+02 1.886e+02 2.174e+02 3.195e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-10 17:57:21,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=438153.3333333333, ans=0.0 2023-10-10 17:57:43,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.62 vs. limit=15.0 2023-10-10 17:57:58,294 INFO [train.py:1031] (0/4) Epoch 7, batch 12000, loss[loss=0.2152, simple_loss=0.3061, pruned_loss=0.06211, over 16887.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.303, pruned_loss=0.0656, over 32718320.12 frames. ], batch size: 116, lr: 5.15e-03, grad_scale: 32.0 2023-10-10 17:58:05,085 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 17:58:15,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=438386.6666666667, ans=0.125 2023-10-10 17:58:23,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=438433.3333333333, ans=0.0 2023-10-10 17:58:24,491 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-10-10 17:58:33,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=438480.0, ans=0.0 2023-10-10 17:58:34,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=438480.0, ans=0.0 2023-10-10 17:58:38,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=438480.0, ans=0.125 2023-10-10 17:58:41,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=438480.0, ans=0.2 2023-10-10 17:58:45,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=438526.6666666667, ans=0.2 2023-10-10 17:58:55,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.32 vs. limit=15.0 2023-10-10 17:58:57,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=438573.3333333333, ans=0.125 2023-10-10 17:59:02,933 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.696e+02 1.899e+02 2.137e+02 3.679e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-10 17:59:06,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=438573.3333333333, ans=0.125 2023-10-10 17:59:20,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438666.6666666667, ans=0.1 2023-10-10 17:59:22,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=438666.6666666667, ans=0.1 2023-10-10 17:59:26,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=438666.6666666667, ans=0.0 2023-10-10 17:59:30,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=438666.6666666667, ans=0.2 2023-10-10 17:59:40,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=438713.3333333333, ans=0.0 2023-10-10 17:59:46,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=438760.0, ans=0.0 2023-10-10 17:59:56,074 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.87 vs. limit=10.0 2023-10-10 18:00:00,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=438806.6666666667, ans=0.0 2023-10-10 18:00:08,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=438853.3333333333, ans=0.125 2023-10-10 18:00:32,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=438946.6666666667, ans=0.125 2023-10-10 18:00:38,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=438946.6666666667, ans=0.2 2023-10-10 18:00:51,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=438993.3333333333, ans=0.05 2023-10-10 18:00:51,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=438993.3333333333, ans=0.125 2023-10-10 18:01:02,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.660e+02 1.811e+02 1.988e+02 2.780e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-10 18:01:05,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=439040.0, ans=0.125 2023-10-10 18:01:26,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=439133.3333333333, ans=0.125 2023-10-10 18:01:49,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439226.6666666667, ans=0.1 2023-10-10 18:02:01,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=439273.3333333333, ans=10.0 2023-10-10 18:02:05,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=439320.0, ans=0.0 2023-10-10 18:02:14,675 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-10-10 18:02:21,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=439366.6666666667, ans=0.0 2023-10-10 18:02:48,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=439460.0, ans=0.125 2023-10-10 18:02:54,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439506.6666666667, ans=0.1 2023-10-10 18:02:58,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=439506.6666666667, ans=0.0 2023-10-10 18:03:00,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.749e+02 1.942e+02 2.153e+02 3.165e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-10 18:03:19,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=439600.0, ans=0.125 2023-10-10 18:03:37,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=439646.6666666667, ans=0.125 2023-10-10 18:03:45,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=439693.3333333333, ans=0.125 2023-10-10 18:03:54,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-10 18:04:10,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=439786.6666666667, ans=0.125 2023-10-10 18:04:20,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=439833.3333333333, ans=0.1 2023-10-10 18:04:27,460 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.94 vs. limit=15.0 2023-10-10 18:04:28,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=439833.3333333333, ans=0.0 2023-10-10 18:04:59,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=439973.3333333333, ans=0.125 2023-10-10 18:05:00,237 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.780e+02 2.007e+02 2.313e+02 4.273e+02, threshold=4.014e+02, percent-clipped=1.0 2023-10-10 18:05:28,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.65 vs. limit=12.0 2023-10-10 18:05:54,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=440160.0, ans=0.2 2023-10-10 18:05:56,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=440206.6666666667, ans=0.0 2023-10-10 18:06:06,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=440206.6666666667, ans=0.0 2023-10-10 18:06:14,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=440253.3333333333, ans=0.125 2023-10-10 18:06:18,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=440253.3333333333, ans=0.125 2023-10-10 18:06:31,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=440346.6666666667, ans=0.0 2023-10-10 18:06:41,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=440346.6666666667, ans=0.09899494936611666 2023-10-10 18:07:02,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.807e+02 2.088e+02 2.424e+02 3.506e+02, threshold=4.177e+02, percent-clipped=0.0 2023-10-10 18:07:04,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=440440.0, ans=0.125 2023-10-10 18:07:08,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=440486.6666666667, ans=0.125 2023-10-10 18:07:10,084 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.70 vs. limit=12.0 2023-10-10 18:07:36,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=440580.0, ans=0.125 2023-10-10 18:07:57,620 INFO [train.py:1031] (0/4) Epoch 7, batch 12500, loss[loss=0.1999, simple_loss=0.2908, pruned_loss=0.05448, over 16802.00 frames. ], tot_loss[loss=0.217, simple_loss=0.3027, pruned_loss=0.06562, over 32709848.36 frames. ], batch size: 175, lr: 5.13e-03, grad_scale: 32.0 2023-10-10 18:08:09,650 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.80 vs. limit=15.0 2023-10-10 18:08:12,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-10-10 18:08:14,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=440720.0, ans=0.2 2023-10-10 18:08:16,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=440720.0, ans=0.0 2023-10-10 18:08:19,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.49 vs. limit=15.0 2023-10-10 18:08:34,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=440813.3333333333, ans=0.025 2023-10-10 18:08:47,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.57 vs. limit=12.0 2023-10-10 18:08:51,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=440860.0, ans=0.125 2023-10-10 18:08:54,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=440906.6666666667, ans=0.0 2023-10-10 18:08:59,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.683e+02 1.854e+02 2.094e+02 3.201e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-10 18:09:19,845 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:09:26,052 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:09:43,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=441093.3333333333, ans=0.125 2023-10-10 18:09:43,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=15.0 2023-10-10 18:09:49,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=441093.3333333333, ans=0.0 2023-10-10 18:10:04,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=441186.6666666667, ans=0.125 2023-10-10 18:10:11,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=441186.6666666667, ans=0.0 2023-10-10 18:10:27,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=441280.0, ans=0.125 2023-10-10 18:10:56,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-10-10 18:10:56,824 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.680e+02 1.860e+02 2.167e+02 3.640e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-10 18:11:15,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=441466.6666666667, ans=0.125 2023-10-10 18:11:31,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=441513.3333333333, ans=0.1 2023-10-10 18:11:46,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=441560.0, ans=0.1 2023-10-10 18:11:55,587 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-10-10 18:12:16,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=441700.0, ans=0.0 2023-10-10 18:12:24,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.08 vs. limit=22.5 2023-10-10 18:12:31,813 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.87 vs. limit=15.0 2023-10-10 18:12:52,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=22.5 2023-10-10 18:12:57,296 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.301e+02 1.684e+02 1.935e+02 2.132e+02 3.537e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-10 18:13:07,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=441886.6666666667, ans=0.2 2023-10-10 18:13:11,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=441886.6666666667, ans=0.125 2023-10-10 18:13:37,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=442026.6666666667, ans=0.0 2023-10-10 18:14:10,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=442120.0, ans=0.125 2023-10-10 18:14:16,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=442166.6666666667, ans=0.0 2023-10-10 18:14:20,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=442166.6666666667, ans=0.125 2023-10-10 18:14:34,871 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.41 vs. limit=22.5 2023-10-10 18:14:58,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.746e+02 1.925e+02 2.149e+02 3.455e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-10 18:15:23,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=442400.0, ans=0.125 2023-10-10 18:15:25,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=442400.0, ans=0.125 2023-10-10 18:15:52,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442540.0, ans=0.1 2023-10-10 18:16:00,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=442540.0, ans=0.0 2023-10-10 18:16:01,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=442540.0, ans=0.2 2023-10-10 18:16:41,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=442726.6666666667, ans=0.125 2023-10-10 18:16:43,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=442726.6666666667, ans=0.0 2023-10-10 18:17:00,448 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.619e+02 1.806e+02 2.042e+02 3.289e+02, threshold=3.612e+02, percent-clipped=0.0 2023-10-10 18:17:09,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.22 vs. limit=15.0 2023-10-10 18:17:15,068 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:17:48,780 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.58 vs. limit=15.0 2023-10-10 18:17:48,794 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.96 vs. limit=22.5 2023-10-10 18:17:50,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=442960.0, ans=0.125 2023-10-10 18:17:52,590 INFO [train.py:1031] (0/4) Epoch 7, batch 13000, loss[loss=0.1967, simple_loss=0.2868, pruned_loss=0.0533, over 15946.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.303, pruned_loss=0.06561, over 32716450.09 frames. ], batch size: 35, lr: 5.12e-03, grad_scale: 16.0 2023-10-10 18:17:52,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=443006.6666666667, ans=0.0 2023-10-10 18:18:10,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.91 vs. limit=22.5 2023-10-10 18:18:33,074 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-10-10 18:18:41,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=443146.6666666667, ans=0.05 2023-10-10 18:18:47,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=443193.3333333333, ans=0.2 2023-10-10 18:19:04,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.719e+02 1.904e+02 2.119e+02 2.835e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-10 18:19:11,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.29 vs. limit=22.5 2023-10-10 18:19:19,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=443286.6666666667, ans=0.05 2023-10-10 18:19:44,784 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.17 vs. limit=15.0 2023-10-10 18:19:50,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=443426.6666666667, ans=0.1 2023-10-10 18:19:57,348 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-10-10 18:20:09,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=443473.3333333333, ans=0.125 2023-10-10 18:20:12,967 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.37 vs. limit=15.0 2023-10-10 18:20:23,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=443566.6666666667, ans=0.2 2023-10-10 18:20:59,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-10-10 18:21:05,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.783e+02 1.996e+02 2.275e+02 3.876e+02, threshold=3.992e+02, percent-clipped=1.0 2023-10-10 18:21:10,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=443706.6666666667, ans=0.0 2023-10-10 18:21:20,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.69 vs. limit=5.0 2023-10-10 18:21:25,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=443800.0, ans=0.2 2023-10-10 18:22:26,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=443986.6666666667, ans=0.0 2023-10-10 18:22:28,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=444033.3333333333, ans=0.125 2023-10-10 18:22:34,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=444033.3333333333, ans=0.1 2023-10-10 18:22:40,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=444080.0, ans=0.0 2023-10-10 18:22:42,905 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:22:54,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=444126.6666666667, ans=0.0 2023-10-10 18:23:08,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.218e+02 1.656e+02 1.837e+02 2.031e+02 2.706e+02, threshold=3.674e+02, percent-clipped=0.0 2023-10-10 18:23:08,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=444173.3333333333, ans=0.05 2023-10-10 18:23:21,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=444220.0, ans=0.125 2023-10-10 18:23:34,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=444266.6666666667, ans=10.0 2023-10-10 18:23:38,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=444313.3333333333, ans=0.1 2023-10-10 18:23:41,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=444313.3333333333, ans=0.0 2023-10-10 18:23:44,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=444313.3333333333, ans=0.125 2023-10-10 18:23:48,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=444360.0, ans=0.2 2023-10-10 18:23:50,545 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-10-10 18:24:04,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=444406.6666666667, ans=0.125 2023-10-10 18:24:35,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=444546.6666666667, ans=0.09899494936611666 2023-10-10 18:24:40,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=444546.6666666667, ans=0.0 2023-10-10 18:24:43,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=444546.6666666667, ans=0.125 2023-10-10 18:24:44,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=444593.3333333333, ans=0.125 2023-10-10 18:24:47,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=444593.3333333333, ans=0.0 2023-10-10 18:24:47,789 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-10-10 18:24:47,808 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.59 vs. limit=15.0 2023-10-10 18:24:48,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=444593.3333333333, ans=0.125 2023-10-10 18:24:48,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=444593.3333333333, ans=0.07 2023-10-10 18:24:51,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=444593.3333333333, ans=0.2 2023-10-10 18:25:04,688 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.753e+02 1.964e+02 2.230e+02 2.801e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-10 18:25:06,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=444640.0, ans=0.1 2023-10-10 18:25:19,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=444733.3333333333, ans=0.0 2023-10-10 18:25:23,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=444733.3333333333, ans=0.2 2023-10-10 18:25:33,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=444780.0, ans=0.125 2023-10-10 18:25:44,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.59 vs. limit=15.0 2023-10-10 18:25:54,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=444873.3333333333, ans=0.0 2023-10-10 18:25:59,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=444873.3333333333, ans=0.125 2023-10-10 18:26:23,173 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.38 vs. limit=22.5 2023-10-10 18:26:33,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.16 vs. limit=15.0 2023-10-10 18:26:34,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=445013.3333333333, ans=0.125 2023-10-10 18:26:45,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=445060.0, ans=0.035 2023-10-10 18:26:57,938 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.90 vs. limit=15.0 2023-10-10 18:27:04,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.746e+02 1.932e+02 2.346e+02 3.404e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-10 18:27:17,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.58 vs. limit=10.0 2023-10-10 18:27:19,700 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-10 18:27:46,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=445293.3333333333, ans=0.0 2023-10-10 18:27:51,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445293.3333333333, ans=0.1 2023-10-10 18:27:53,816 INFO [train.py:1031] (0/4) Epoch 7, batch 13500, loss[loss=0.202, simple_loss=0.2591, pruned_loss=0.07245, over 12471.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.3023, pruned_loss=0.06536, over 32723279.98 frames. ], batch size: 440, lr: 5.11e-03, grad_scale: 32.0 2023-10-10 18:28:05,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=445386.6666666667, ans=0.125 2023-10-10 18:28:07,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=445386.6666666667, ans=0.125 2023-10-10 18:28:39,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=445526.6666666667, ans=0.125 2023-10-10 18:28:56,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.709e+02 1.944e+02 2.257e+02 3.330e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-10 18:29:13,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=445666.6666666667, ans=0.125 2023-10-10 18:29:14,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=445666.6666666667, ans=0.0 2023-10-10 18:29:20,923 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:29:35,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=445760.0, ans=0.09899494936611666 2023-10-10 18:29:35,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=445760.0, ans=0.95 2023-10-10 18:29:40,078 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:29:51,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=445806.6666666667, ans=0.0 2023-10-10 18:30:37,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=445993.3333333333, ans=0.0 2023-10-10 18:30:45,859 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-7.pt 2023-10-10 18:31:21,193 INFO [train.py:1031] (0/4) Epoch 8, batch 0, loss[loss=0.2066, simple_loss=0.2977, pruned_loss=0.05775, over 16877.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2977, pruned_loss=0.05775, over 16877.00 frames. ], batch size: 155, lr: 4.73e-03, grad_scale: 32.0 2023-10-10 18:31:21,194 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-10 18:31:30,336 INFO [train.py:1063] (0/4) Epoch 8, validation: loss=0.2272, simple_loss=0.314, pruned_loss=0.07016, over 1020973.00 frames. 2023-10-10 18:31:30,336 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-10 18:31:31,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.738e+02 1.931e+02 2.283e+02 5.705e+02, threshold=3.861e+02, percent-clipped=1.0 2023-10-10 18:31:33,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.65 vs. limit=22.5 2023-10-10 18:31:45,483 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.46 vs. limit=22.5 2023-10-10 18:31:50,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=446110.0, ans=0.125 2023-10-10 18:32:26,401 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:32:26,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=446250.0, ans=0.0 2023-10-10 18:32:36,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=446296.6666666667, ans=0.125 2023-10-10 18:33:16,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=446436.6666666667, ans=0.2 2023-10-10 18:33:32,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.26 vs. limit=15.0 2023-10-10 18:33:32,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.768e+02 2.068e+02 2.359e+02 4.985e+02, threshold=4.136e+02, percent-clipped=3.0 2023-10-10 18:33:39,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-10-10 18:33:57,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=446623.3333333333, ans=0.0 2023-10-10 18:34:22,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=446716.6666666667, ans=0.125 2023-10-10 18:34:48,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=446810.0, ans=0.125 2023-10-10 18:34:58,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=446856.6666666667, ans=0.2 2023-10-10 18:35:10,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=446903.3333333333, ans=0.125 2023-10-10 18:35:11,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=446903.3333333333, ans=0.0 2023-10-10 18:35:23,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=446950.0, ans=0.0 2023-10-10 18:35:27,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.782e+02 1.937e+02 2.179e+02 3.114e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-10 18:35:45,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447043.3333333333, ans=0.1 2023-10-10 18:35:49,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=447090.0, ans=0.0 2023-10-10 18:36:28,473 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.13 vs. limit=22.5 2023-10-10 18:36:47,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=447276.6666666667, ans=0.2 2023-10-10 18:37:14,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=447370.0, ans=0.125 2023-10-10 18:37:24,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=447416.6666666667, ans=0.05 2023-10-10 18:37:26,207 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-10-10 18:37:26,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=447463.3333333333, ans=0.2 2023-10-10 18:37:28,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.786e+02 1.930e+02 2.325e+02 3.073e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-10 18:37:31,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=447463.3333333333, ans=0.0 2023-10-10 18:38:03,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=447603.3333333333, ans=0.2 2023-10-10 18:38:06,493 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.63 vs. limit=22.5 2023-10-10 18:38:11,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=447603.3333333333, ans=0.125 2023-10-10 18:38:16,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=447650.0, ans=0.1 2023-10-10 18:38:23,984 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.21 vs. limit=15.0 2023-10-10 18:38:32,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=447696.6666666667, ans=0.125 2023-10-10 18:39:00,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=447836.6666666667, ans=0.125 2023-10-10 18:39:07,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=447836.6666666667, ans=0.0 2023-10-10 18:39:13,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=447883.3333333333, ans=0.125 2023-10-10 18:39:24,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.727e+02 1.913e+02 2.087e+02 2.905e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-10 18:39:32,050 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-10-10 18:39:32,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=447930.0, ans=0.2 2023-10-10 18:39:39,109 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-96000.pt 2023-10-10 18:40:07,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=448116.6666666667, ans=0.125 2023-10-10 18:40:28,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.41 vs. limit=10.0 2023-10-10 18:40:46,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=448256.6666666667, ans=0.0 2023-10-10 18:40:50,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=448256.6666666667, ans=0.0 2023-10-10 18:41:01,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=448303.3333333333, ans=0.09899494936611666 2023-10-10 18:41:09,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=448350.0, ans=0.125 2023-10-10 18:41:18,195 INFO [train.py:1031] (0/4) Epoch 8, batch 500, loss[loss=0.2039, simple_loss=0.297, pruned_loss=0.05536, over 16871.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.3014, pruned_loss=0.06465, over 7259067.17 frames. ], batch size: 104, lr: 4.72e-03, grad_scale: 16.0 2023-10-10 18:41:20,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.702e+02 1.882e+02 2.126e+02 3.210e+02, threshold=3.763e+02, percent-clipped=0.0 2023-10-10 18:41:24,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-10 18:41:27,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.25 vs. limit=22.5 2023-10-10 18:41:28,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=448443.3333333333, ans=0.125 2023-10-10 18:41:44,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=448490.0, ans=0.1 2023-10-10 18:42:08,760 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-10-10 18:42:20,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=448630.0, ans=0.125 2023-10-10 18:42:27,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=448676.6666666667, ans=10.0 2023-10-10 18:42:47,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=448723.3333333333, ans=0.125 2023-10-10 18:42:57,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=448770.0, ans=0.125 2023-10-10 18:43:03,229 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.62 vs. limit=15.0 2023-10-10 18:43:12,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=448863.3333333333, ans=0.125 2023-10-10 18:43:14,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.846e+02 2.141e+02 2.412e+02 3.601e+02, threshold=4.282e+02, percent-clipped=0.0 2023-10-10 18:43:15,787 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:44:52,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=449283.3333333333, ans=0.0 2023-10-10 18:45:03,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.805e+02 2.063e+02 2.309e+02 3.366e+02, threshold=4.127e+02, percent-clipped=0.0 2023-10-10 18:45:10,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=449330.0, ans=0.5 2023-10-10 18:45:37,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=449470.0, ans=0.125 2023-10-10 18:45:42,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=449470.0, ans=0.125 2023-10-10 18:46:11,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-10-10 18:46:29,673 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-10-10 18:46:42,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=449703.3333333333, ans=0.125 2023-10-10 18:46:56,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=449750.0, ans=0.2 2023-10-10 18:47:01,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=449796.6666666667, ans=0.125 2023-10-10 18:47:04,262 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.721e+02 1.886e+02 2.253e+02 3.352e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-10 18:47:08,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=449796.6666666667, ans=0.04949747468305833 2023-10-10 18:47:48,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=449936.6666666667, ans=0.0 2023-10-10 18:47:50,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.96 vs. limit=15.0 2023-10-10 18:48:04,293 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.33 vs. limit=15.0 2023-10-10 18:48:14,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=450030.0, ans=0.125 2023-10-10 18:48:17,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=450076.6666666667, ans=0.07 2023-10-10 18:48:20,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=450076.6666666667, ans=0.125 2023-10-10 18:48:52,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=450170.0, ans=0.0 2023-10-10 18:48:59,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=450216.6666666667, ans=0.07 2023-10-10 18:49:03,009 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:49:05,969 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.81 vs. limit=22.5 2023-10-10 18:49:12,286 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.258e+02 1.723e+02 1.977e+02 2.211e+02 3.266e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-10 18:49:13,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=450263.3333333333, ans=0.2 2023-10-10 18:49:26,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=450310.0, ans=0.2 2023-10-10 18:49:26,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=450310.0, ans=0.125 2023-10-10 18:49:28,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=450310.0, ans=0.0 2023-10-10 18:49:46,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=450403.3333333333, ans=0.0 2023-10-10 18:49:52,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=450403.3333333333, ans=0.125 2023-10-10 18:49:54,365 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-10-10 18:50:06,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=450450.0, ans=10.0 2023-10-10 18:50:21,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450496.6666666667, ans=0.1 2023-10-10 18:50:30,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=450543.3333333333, ans=0.0 2023-10-10 18:50:35,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=450590.0, ans=0.0 2023-10-10 18:51:04,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=450683.3333333333, ans=0.125 2023-10-10 18:51:10,192 INFO [train.py:1031] (0/4) Epoch 8, batch 1000, loss[loss=0.2016, simple_loss=0.2909, pruned_loss=0.05611, over 16960.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.3016, pruned_loss=0.06511, over 12893796.25 frames. ], batch size: 72, lr: 4.71e-03, grad_scale: 32.0 2023-10-10 18:51:11,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=450730.0, ans=0.2 2023-10-10 18:51:12,039 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.659e+02 1.827e+02 2.112e+02 3.229e+02, threshold=3.654e+02, percent-clipped=0.0 2023-10-10 18:51:30,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=450776.6666666667, ans=0.125 2023-10-10 18:51:31,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=450776.6666666667, ans=0.125 2023-10-10 18:51:40,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=450823.3333333333, ans=0.0 2023-10-10 18:52:03,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=450963.3333333333, ans=0.2 2023-10-10 18:52:03,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=450963.3333333333, ans=0.125 2023-10-10 18:52:08,895 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.59 vs. limit=15.0 2023-10-10 18:52:09,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=450963.3333333333, ans=0.07 2023-10-10 18:52:35,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=451056.6666666667, ans=0.035 2023-10-10 18:52:46,434 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.31 vs. limit=22.5 2023-10-10 18:52:47,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=451103.3333333333, ans=0.125 2023-10-10 18:52:59,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=451196.6666666667, ans=0.2 2023-10-10 18:53:00,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.715e+02 2.001e+02 2.301e+02 3.343e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-10 18:53:07,541 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.35 vs. limit=15.0 2023-10-10 18:53:30,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451290.0, ans=0.1 2023-10-10 18:53:34,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-10-10 18:53:46,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=451336.6666666667, ans=0.09899494936611666 2023-10-10 18:53:48,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=451383.3333333333, ans=0.0 2023-10-10 18:53:54,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=22.5 2023-10-10 18:53:59,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=451383.3333333333, ans=0.0 2023-10-10 18:54:08,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=451430.0, ans=0.1 2023-10-10 18:54:18,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=451476.6666666667, ans=0.125 2023-10-10 18:54:20,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=451476.6666666667, ans=0.2 2023-10-10 18:54:33,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=451523.3333333333, ans=0.07 2023-10-10 18:54:38,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=451523.3333333333, ans=0.125 2023-10-10 18:54:44,455 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=22.5 2023-10-10 18:54:49,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451570.0, ans=0.1 2023-10-10 18:55:07,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=451616.6666666667, ans=0.05 2023-10-10 18:55:11,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.592e+02 1.748e+02 2.020e+02 3.161e+02, threshold=3.496e+02, percent-clipped=0.0 2023-10-10 18:55:39,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=451756.6666666667, ans=0.125 2023-10-10 18:55:54,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=451803.3333333333, ans=0.0 2023-10-10 18:56:07,331 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:56:09,451 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 18:56:19,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=451896.6666666667, ans=0.125 2023-10-10 18:57:10,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.251e+02 1.690e+02 1.889e+02 2.045e+02 2.714e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-10 18:57:13,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=452130.0, ans=0.0 2023-10-10 18:57:23,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=452176.6666666667, ans=0.125 2023-10-10 18:57:26,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=452176.6666666667, ans=0.125 2023-10-10 18:57:37,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=452223.3333333333, ans=0.0 2023-10-10 18:57:51,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=452270.0, ans=0.0 2023-10-10 18:58:06,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=452363.3333333333, ans=0.125 2023-10-10 18:58:33,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=452456.6666666667, ans=0.05 2023-10-10 18:58:56,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=452550.0, ans=0.2 2023-10-10 18:59:09,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=452596.6666666667, ans=0.125 2023-10-10 18:59:09,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.683e+02 1.885e+02 2.096e+02 2.668e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-10 18:59:35,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=452690.0, ans=0.0 2023-10-10 18:59:43,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=452736.6666666667, ans=0.07 2023-10-10 18:59:52,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-10-10 18:59:55,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452783.3333333333, ans=0.1 2023-10-10 19:00:00,131 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:00:39,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=452923.3333333333, ans=0.0 2023-10-10 19:00:46,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=452970.0, ans=0.5 2023-10-10 19:00:47,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=452970.0, ans=0.125 2023-10-10 19:00:47,446 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-10-10 19:00:52,236 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.57 vs. limit=15.0 2023-10-10 19:00:57,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=453016.6666666667, ans=0.125 2023-10-10 19:01:06,212 INFO [train.py:1031] (0/4) Epoch 8, batch 1500, loss[loss=0.2233, simple_loss=0.2983, pruned_loss=0.07416, over 16944.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.3002, pruned_loss=0.06409, over 17304652.72 frames. ], batch size: 110, lr: 4.70e-03, grad_scale: 32.0 2023-10-10 19:01:07,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.640e+02 1.833e+02 2.091e+02 2.968e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-10 19:01:16,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=453063.3333333333, ans=0.2 2023-10-10 19:01:30,171 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.33 vs. limit=15.0 2023-10-10 19:01:37,064 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-10-10 19:01:37,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=453156.6666666667, ans=0.2 2023-10-10 19:01:58,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=453250.0, ans=0.125 2023-10-10 19:02:09,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=453296.6666666667, ans=0.125 2023-10-10 19:02:39,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=453390.0, ans=0.125 2023-10-10 19:02:42,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=453390.0, ans=0.125 2023-10-10 19:03:02,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=453483.3333333333, ans=0.125 2023-10-10 19:03:10,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.760e+02 1.969e+02 2.327e+02 4.256e+02, threshold=3.939e+02, percent-clipped=4.0 2023-10-10 19:03:29,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=453623.3333333333, ans=0.0 2023-10-10 19:03:35,013 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.39 vs. limit=15.0 2023-10-10 19:03:47,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.05 vs. limit=12.0 2023-10-10 19:03:49,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=453670.0, ans=0.125 2023-10-10 19:03:56,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=453716.6666666667, ans=0.125 2023-10-10 19:04:07,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=453763.3333333333, ans=0.125 2023-10-10 19:04:13,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=453763.3333333333, ans=0.0 2023-10-10 19:04:24,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=453810.0, ans=0.0 2023-10-10 19:04:24,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=453810.0, ans=0.0 2023-10-10 19:04:30,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=453856.6666666667, ans=0.0 2023-10-10 19:04:41,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=453856.6666666667, ans=0.0 2023-10-10 19:04:48,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=453903.3333333333, ans=0.125 2023-10-10 19:04:55,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453950.0, ans=0.1 2023-10-10 19:04:57,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.24 vs. limit=15.0 2023-10-10 19:05:01,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=453950.0, ans=0.2 2023-10-10 19:05:10,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.685e+02 1.921e+02 2.100e+02 3.249e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-10 19:05:34,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=454090.0, ans=0.125 2023-10-10 19:05:55,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=454183.3333333333, ans=0.125 2023-10-10 19:06:12,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=454230.0, ans=0.0 2023-10-10 19:06:12,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=454230.0, ans=0.125 2023-10-10 19:06:22,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-10-10 19:06:37,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=454323.3333333333, ans=0.0 2023-10-10 19:07:04,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.89 vs. limit=15.0 2023-10-10 19:07:11,684 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.225e+02 1.617e+02 1.760e+02 1.975e+02 2.464e+02, threshold=3.520e+02, percent-clipped=0.0 2023-10-10 19:07:56,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=454603.3333333333, ans=0.125 2023-10-10 19:08:19,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=454696.6666666667, ans=0.0 2023-10-10 19:08:30,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=454743.3333333333, ans=0.07 2023-10-10 19:08:36,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=454790.0, ans=0.1 2023-10-10 19:08:37,888 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=15.0 2023-10-10 19:08:48,232 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-10-10 19:08:51,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=454836.6666666667, ans=0.0 2023-10-10 19:09:00,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=454883.3333333333, ans=0.0 2023-10-10 19:09:04,404 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:09:08,096 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.667e+02 1.918e+02 2.165e+02 3.283e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-10 19:09:52,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=455116.6666666667, ans=0.125 2023-10-10 19:09:53,790 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.89 vs. limit=22.5 2023-10-10 19:10:03,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=455116.6666666667, ans=0.0 2023-10-10 19:10:17,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=455163.3333333333, ans=0.125 2023-10-10 19:10:55,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455303.3333333333, ans=0.1 2023-10-10 19:11:10,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=455350.0, ans=0.125 2023-10-10 19:11:15,519 INFO [train.py:1031] (0/4) Epoch 8, batch 2000, loss[loss=0.1991, simple_loss=0.2947, pruned_loss=0.0517, over 16810.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.3009, pruned_loss=0.064, over 20761177.12 frames. ], batch size: 146, lr: 4.68e-03, grad_scale: 32.0 2023-10-10 19:11:16,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=455396.6666666667, ans=0.1 2023-10-10 19:11:18,666 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.641e+02 1.856e+02 2.042e+02 2.982e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-10 19:11:42,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=455490.0, ans=0.125 2023-10-10 19:11:52,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=455490.0, ans=0.0 2023-10-10 19:12:19,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=455583.3333333333, ans=0.0 2023-10-10 19:12:21,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-10 19:12:21,193 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.86 vs. limit=22.5 2023-10-10 19:12:31,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=455630.0, ans=0.125 2023-10-10 19:12:44,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.96 vs. limit=22.5 2023-10-10 19:12:49,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=455723.3333333333, ans=0.125 2023-10-10 19:13:30,899 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.670e+02 1.803e+02 1.969e+02 2.816e+02, threshold=3.607e+02, percent-clipped=0.0 2023-10-10 19:13:55,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=455910.0, ans=0.2 2023-10-10 19:14:40,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=456050.0, ans=0.125 2023-10-10 19:14:42,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=456050.0, ans=0.2 2023-10-10 19:15:02,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=456096.6666666667, ans=0.2 2023-10-10 19:15:06,693 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=22.5 2023-10-10 19:15:15,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=456143.3333333333, ans=0.025 2023-10-10 19:15:20,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.49 vs. limit=10.0 2023-10-10 19:15:20,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=456190.0, ans=0.0 2023-10-10 19:15:34,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=456236.6666666667, ans=0.2 2023-10-10 19:15:49,957 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.23 vs. limit=22.5 2023-10-10 19:15:56,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.952e+02 2.217e+02 2.638e+02 3.557e+02, threshold=4.434e+02, percent-clipped=0.0 2023-10-10 19:16:09,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=456376.6666666667, ans=0.0 2023-10-10 19:16:37,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=456470.0, ans=0.125 2023-10-10 19:16:42,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=456516.6666666667, ans=0.125 2023-10-10 19:16:43,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456516.6666666667, ans=0.1 2023-10-10 19:16:58,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456563.3333333333, ans=0.1 2023-10-10 19:17:03,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-10-10 19:17:40,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=456750.0, ans=0.125 2023-10-10 19:17:44,956 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=15.0 2023-10-10 19:17:49,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.777e+02 1.902e+02 2.152e+02 3.407e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-10 19:18:02,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=456843.3333333333, ans=0.2 2023-10-10 19:18:06,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=456843.3333333333, ans=0.2 2023-10-10 19:18:19,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=456936.6666666667, ans=0.0 2023-10-10 19:18:24,729 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.72 vs. limit=10.0 2023-10-10 19:18:51,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457030.0, ans=0.1 2023-10-10 19:19:00,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=457076.6666666667, ans=0.125 2023-10-10 19:19:06,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=457076.6666666667, ans=0.125 2023-10-10 19:19:09,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=457123.3333333333, ans=0.0 2023-10-10 19:19:21,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457170.0, ans=0.1 2023-10-10 19:19:28,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=457170.0, ans=0.0 2023-10-10 19:19:39,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=457216.6666666667, ans=0.125 2023-10-10 19:19:41,961 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:19:42,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=457263.3333333333, ans=0.0 2023-10-10 19:19:45,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.734e+02 1.937e+02 2.104e+02 2.901e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-10 19:20:04,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=457356.6666666667, ans=0.0 2023-10-10 19:20:27,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=457450.0, ans=0.125 2023-10-10 19:20:27,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=457450.0, ans=0.0 2023-10-10 19:21:05,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=457590.0, ans=0.125 2023-10-10 19:21:07,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=457636.6666666667, ans=0.125 2023-10-10 19:21:19,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=457683.3333333333, ans=0.125 2023-10-10 19:21:25,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=457683.3333333333, ans=0.125 2023-10-10 19:21:27,655 INFO [train.py:1031] (0/4) Epoch 8, batch 2500, loss[loss=0.203, simple_loss=0.2877, pruned_loss=0.05916, over 15882.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.3015, pruned_loss=0.06458, over 23452581.67 frames. ], batch size: 43, lr: 4.67e-03, grad_scale: 16.0 2023-10-10 19:21:32,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.734e+02 1.898e+02 2.162e+02 2.982e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-10 19:21:32,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=457730.0, ans=0.0 2023-10-10 19:22:07,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457870.0, ans=0.1 2023-10-10 19:22:15,184 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:22:29,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=457963.3333333333, ans=0.95 2023-10-10 19:22:30,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=457963.3333333333, ans=0.125 2023-10-10 19:22:40,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=458010.0, ans=0.125 2023-10-10 19:22:41,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=458010.0, ans=0.0 2023-10-10 19:22:55,348 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-10-10 19:23:00,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=458103.3333333333, ans=0.05 2023-10-10 19:23:11,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=458150.0, ans=0.1 2023-10-10 19:23:16,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.25 vs. limit=15.0 2023-10-10 19:23:18,439 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.26 vs. limit=22.5 2023-10-10 19:23:21,108 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.690e+02 1.868e+02 2.066e+02 2.868e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-10 19:23:31,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=458243.3333333333, ans=0.125 2023-10-10 19:23:53,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-10-10 19:23:58,122 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.52 vs. limit=8.0 2023-10-10 19:24:20,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=458430.0, ans=0.0 2023-10-10 19:24:31,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=458476.6666666667, ans=0.0 2023-10-10 19:24:32,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-10-10 19:24:46,898 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.30 vs. limit=12.0 2023-10-10 19:24:49,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=458570.0, ans=0.5 2023-10-10 19:24:50,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=458570.0, ans=0.125 2023-10-10 19:25:18,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.695e+02 1.814e+02 2.017e+02 2.886e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-10 19:25:45,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=458756.6666666667, ans=0.0 2023-10-10 19:25:45,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=458756.6666666667, ans=0.0 2023-10-10 19:25:47,295 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.67 vs. limit=10.0 2023-10-10 19:25:49,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=458803.3333333333, ans=0.1 2023-10-10 19:26:13,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=458896.6666666667, ans=0.0 2023-10-10 19:26:21,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=458896.6666666667, ans=0.09899494936611666 2023-10-10 19:26:31,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=458943.3333333333, ans=0.2 2023-10-10 19:26:31,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=458943.3333333333, ans=0.07 2023-10-10 19:26:36,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=458943.3333333333, ans=0.125 2023-10-10 19:26:46,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=458990.0, ans=0.125 2023-10-10 19:26:48,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458990.0, ans=0.1 2023-10-10 19:27:14,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=459083.3333333333, ans=0.0 2023-10-10 19:27:14,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=459083.3333333333, ans=0.0 2023-10-10 19:27:14,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=459083.3333333333, ans=0.125 2023-10-10 19:27:18,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=459130.0, ans=0.125 2023-10-10 19:27:21,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=459130.0, ans=0.5 2023-10-10 19:27:21,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.668e+02 1.872e+02 2.019e+02 2.775e+02, threshold=3.743e+02, percent-clipped=0.0 2023-10-10 19:27:35,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=459176.6666666667, ans=0.125 2023-10-10 19:27:37,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=459176.6666666667, ans=0.1 2023-10-10 19:27:41,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=459176.6666666667, ans=0.125 2023-10-10 19:28:03,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=459270.0, ans=0.125 2023-10-10 19:28:10,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=459316.6666666667, ans=0.1 2023-10-10 19:28:23,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=459363.3333333333, ans=0.125 2023-10-10 19:29:09,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=459503.3333333333, ans=0.2 2023-10-10 19:29:18,699 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-10-10 19:29:33,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.714e+02 1.862e+02 2.133e+02 3.067e+02, threshold=3.724e+02, percent-clipped=0.0 2023-10-10 19:29:41,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=459643.3333333333, ans=0.125 2023-10-10 19:29:47,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=459643.3333333333, ans=0.125 2023-10-10 19:29:48,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=459643.3333333333, ans=0.2 2023-10-10 19:29:55,981 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:30:26,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=459783.3333333333, ans=0.125 2023-10-10 19:30:29,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=459830.0, ans=0.125 2023-10-10 19:30:41,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=459876.6666666667, ans=10.0 2023-10-10 19:30:48,978 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:31:12,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=459970.0, ans=0.125 2023-10-10 19:31:25,070 INFO [train.py:1031] (0/4) Epoch 8, batch 3000, loss[loss=0.2107, simple_loss=0.2993, pruned_loss=0.06107, over 16855.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.3005, pruned_loss=0.06453, over 25499543.54 frames. ], batch size: 175, lr: 4.66e-03, grad_scale: 32.0 2023-10-10 19:31:30,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=460063.3333333333, ans=0.0 2023-10-10 19:31:31,096 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.620e+02 1.833e+02 2.039e+02 2.693e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-10 19:31:37,293 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:31:59,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=460203.3333333333, ans=0.2 2023-10-10 19:32:12,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=460250.0, ans=0.125 2023-10-10 19:32:19,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-10-10 19:32:55,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=460436.6666666667, ans=0.125 2023-10-10 19:33:12,469 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=15.0 2023-10-10 19:33:14,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=460483.3333333333, ans=0.125 2023-10-10 19:33:28,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.256e+02 1.699e+02 1.981e+02 2.265e+02 3.339e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-10 19:33:42,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-10-10 19:34:00,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=460623.3333333333, ans=0.0 2023-10-10 19:34:33,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=460763.3333333333, ans=0.125 2023-10-10 19:34:38,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=460810.0, ans=0.125 2023-10-10 19:34:43,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=460810.0, ans=0.125 2023-10-10 19:34:43,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.10 vs. limit=22.5 2023-10-10 19:35:00,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-10-10 19:35:26,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=460996.6666666667, ans=0.125 2023-10-10 19:35:29,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.634e+02 1.811e+02 2.162e+02 2.975e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-10 19:35:32,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460996.6666666667, ans=0.1 2023-10-10 19:35:32,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=460996.6666666667, ans=0.0 2023-10-10 19:35:38,609 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-10-10 19:36:39,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=461230.0, ans=0.1 2023-10-10 19:37:04,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-10-10 19:37:39,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.664e+02 1.777e+02 1.978e+02 2.612e+02, threshold=3.555e+02, percent-clipped=0.0 2023-10-10 19:37:42,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=461463.3333333333, ans=0.0 2023-10-10 19:38:16,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=461603.3333333333, ans=0.125 2023-10-10 19:38:21,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=461603.3333333333, ans=0.2 2023-10-10 19:38:25,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=461650.0, ans=0.125 2023-10-10 19:38:29,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.18 vs. limit=15.0 2023-10-10 19:38:33,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=461696.6666666667, ans=0.125 2023-10-10 19:38:36,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461696.6666666667, ans=0.1 2023-10-10 19:38:38,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.82 vs. limit=15.0 2023-10-10 19:38:58,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461790.0, ans=0.1 2023-10-10 19:39:03,412 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-10 19:39:13,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=461836.6666666667, ans=0.125 2023-10-10 19:39:28,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=461883.3333333333, ans=0.125 2023-10-10 19:39:30,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=461883.3333333333, ans=0.07 2023-10-10 19:39:39,489 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.708e+02 1.969e+02 2.305e+02 3.490e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-10 19:39:45,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=461976.6666666667, ans=0.07 2023-10-10 19:40:01,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=462023.3333333333, ans=0.125 2023-10-10 19:40:06,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=462023.3333333333, ans=0.2 2023-10-10 19:40:08,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=462070.0, ans=0.0 2023-10-10 19:40:17,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=462070.0, ans=0.125 2023-10-10 19:40:18,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=462116.6666666667, ans=0.1 2023-10-10 19:40:42,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=462210.0, ans=0.125 2023-10-10 19:40:42,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.21 vs. limit=22.5 2023-10-10 19:41:28,858 INFO [train.py:1031] (0/4) Epoch 8, batch 3500, loss[loss=0.3053, simple_loss=0.3529, pruned_loss=0.1288, over 15614.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.3001, pruned_loss=0.06429, over 27107073.36 frames. ], batch size: 350, lr: 4.65e-03, grad_scale: 32.0 2023-10-10 19:41:33,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.706e+02 1.905e+02 2.085e+02 2.469e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-10 19:41:37,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=462396.6666666667, ans=0.1 2023-10-10 19:41:49,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=462443.3333333333, ans=0.2 2023-10-10 19:42:02,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=462536.6666666667, ans=0.2 2023-10-10 19:42:02,973 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.05 vs. limit=10.0 2023-10-10 19:42:12,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=462536.6666666667, ans=0.09899494936611666 2023-10-10 19:42:16,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=462583.3333333333, ans=0.1 2023-10-10 19:42:41,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=462676.6666666667, ans=0.09899494936611666 2023-10-10 19:42:49,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=462676.6666666667, ans=0.0 2023-10-10 19:42:56,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=462723.3333333333, ans=0.125 2023-10-10 19:43:04,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=462723.3333333333, ans=0.125 2023-10-10 19:43:23,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.64 vs. limit=6.0 2023-10-10 19:43:36,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.692e+02 1.904e+02 2.331e+02 3.234e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-10 19:43:51,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=462910.0, ans=0.0 2023-10-10 19:44:23,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=463050.0, ans=0.0 2023-10-10 19:44:50,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=463143.3333333333, ans=0.1 2023-10-10 19:45:20,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.20 vs. limit=15.0 2023-10-10 19:45:22,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=463283.3333333333, ans=0.125 2023-10-10 19:45:39,371 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.582e+02 1.768e+02 1.979e+02 2.564e+02, threshold=3.535e+02, percent-clipped=0.0 2023-10-10 19:45:53,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=463376.6666666667, ans=0.0 2023-10-10 19:46:06,523 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=12.0 2023-10-10 19:46:14,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=463470.0, ans=0.125 2023-10-10 19:46:40,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=463563.3333333333, ans=0.2 2023-10-10 19:46:53,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=463610.0, ans=0.125 2023-10-10 19:47:27,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=463750.0, ans=0.125 2023-10-10 19:47:29,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=463750.0, ans=0.125 2023-10-10 19:47:31,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=463750.0, ans=0.0 2023-10-10 19:47:39,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=463796.6666666667, ans=0.125 2023-10-10 19:47:42,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.639e+02 1.900e+02 2.227e+02 3.211e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-10 19:47:55,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=463843.3333333333, ans=0.125 2023-10-10 19:48:00,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=463890.0, ans=0.2 2023-10-10 19:48:01,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=463890.0, ans=0.125 2023-10-10 19:48:02,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.82 vs. limit=15.0 2023-10-10 19:48:07,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=463890.0, ans=0.0 2023-10-10 19:48:28,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=463983.3333333333, ans=0.125 2023-10-10 19:48:29,886 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.270e-02 2023-10-10 19:48:43,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=464030.0, ans=0.125 2023-10-10 19:48:52,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=464076.6666666667, ans=0.125 2023-10-10 19:48:53,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=464076.6666666667, ans=0.125 2023-10-10 19:49:13,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=464170.0, ans=0.125 2023-10-10 19:49:17,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=464170.0, ans=0.125 2023-10-10 19:49:34,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=464263.3333333333, ans=0.125 2023-10-10 19:49:37,140 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.681e+02 1.880e+02 2.265e+02 3.157e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-10 19:49:51,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=464310.0, ans=0.07 2023-10-10 19:50:51,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=464543.3333333333, ans=0.2 2023-10-10 19:51:02,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=464590.0, ans=0.125 2023-10-10 19:51:10,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=464636.6666666667, ans=0.125 2023-10-10 19:51:18,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464683.3333333333, ans=0.1 2023-10-10 19:51:27,484 INFO [train.py:1031] (0/4) Epoch 8, batch 4000, loss[loss=0.2117, simple_loss=0.2656, pruned_loss=0.07892, over 12201.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2995, pruned_loss=0.06412, over 28352915.08 frames. ], batch size: 440, lr: 4.64e-03, grad_scale: 32.0 2023-10-10 19:51:34,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.298e+02 1.704e+02 1.889e+02 2.167e+02 2.855e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-10 19:51:36,155 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-10-10 19:51:44,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=464776.6666666667, ans=0.0 2023-10-10 19:51:48,139 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.21 vs. limit=6.0 2023-10-10 19:52:00,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=464823.3333333333, ans=0.2 2023-10-10 19:52:06,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.63 vs. limit=6.0 2023-10-10 19:52:28,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=464963.3333333333, ans=0.125 2023-10-10 19:52:48,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=465010.0, ans=0.0 2023-10-10 19:52:48,930 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-10-10 19:53:08,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.82 vs. limit=15.0 2023-10-10 19:53:11,665 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.70 vs. limit=22.5 2023-10-10 19:53:28,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.22 vs. limit=6.0 2023-10-10 19:53:32,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=465196.6666666667, ans=0.125 2023-10-10 19:53:32,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.789e+02 1.948e+02 2.277e+02 3.455e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-10 19:53:46,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=465243.3333333333, ans=0.0 2023-10-10 19:53:48,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=465243.3333333333, ans=0.125 2023-10-10 19:53:49,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=465290.0, ans=0.125 2023-10-10 19:54:01,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=465336.6666666667, ans=0.2 2023-10-10 19:54:18,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=465383.3333333333, ans=0.125 2023-10-10 19:54:24,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=465383.3333333333, ans=0.0 2023-10-10 19:55:05,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=465523.3333333333, ans=0.1 2023-10-10 19:55:35,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=465616.6666666667, ans=0.125 2023-10-10 19:55:37,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=465616.6666666667, ans=0.1 2023-10-10 19:55:46,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.624e+02 1.791e+02 1.975e+02 2.839e+02, threshold=3.581e+02, percent-clipped=0.0 2023-10-10 19:55:49,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=465663.3333333333, ans=0.125 2023-10-10 19:55:53,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=465710.0, ans=0.05 2023-10-10 19:56:09,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=465756.6666666667, ans=0.0 2023-10-10 19:56:27,876 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.49 vs. limit=15.0 2023-10-10 19:56:58,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=12.0 2023-10-10 19:57:21,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=466036.6666666667, ans=0.125 2023-10-10 19:57:32,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=466083.3333333333, ans=0.125 2023-10-10 19:57:35,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.74 vs. limit=15.0 2023-10-10 19:57:44,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.753e+02 1.904e+02 2.130e+02 3.789e+02, threshold=3.808e+02, percent-clipped=1.0 2023-10-10 19:58:32,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=466316.6666666667, ans=0.0 2023-10-10 19:58:53,269 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 19:58:59,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466410.0, ans=0.1 2023-10-10 19:59:13,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-10-10 19:59:17,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=466503.3333333333, ans=0.0 2023-10-10 19:59:18,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=466503.3333333333, ans=0.04949747468305833 2023-10-10 19:59:41,640 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.749e+02 1.937e+02 2.197e+02 3.077e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-10 19:59:45,699 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.29 vs. limit=10.0 2023-10-10 19:59:49,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=466643.3333333333, ans=0.125 2023-10-10 19:59:57,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=466643.3333333333, ans=0.125 2023-10-10 20:00:09,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=466690.0, ans=0.125 2023-10-10 20:00:32,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=466783.3333333333, ans=0.07 2023-10-10 20:00:58,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=466876.6666666667, ans=0.0 2023-10-10 20:01:16,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=466923.3333333333, ans=0.125 2023-10-10 20:01:23,861 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.07 vs. limit=22.5 2023-10-10 20:01:31,705 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.10 vs. limit=15.0 2023-10-10 20:01:45,057 INFO [train.py:1031] (0/4) Epoch 8, batch 4500, loss[loss=0.2177, simple_loss=0.3088, pruned_loss=0.06327, over 16825.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2998, pruned_loss=0.06397, over 29327321.60 frames. ], batch size: 146, lr: 4.63e-03, grad_scale: 32.0 2023-10-10 20:01:47,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=467063.3333333333, ans=0.0 2023-10-10 20:01:47,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=467063.3333333333, ans=0.2 2023-10-10 20:01:51,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.700e+02 1.957e+02 2.302e+02 3.915e+02, threshold=3.913e+02, percent-clipped=1.0 2023-10-10 20:02:11,142 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:02:20,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.93 vs. limit=22.5 2023-10-10 20:02:29,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=467203.3333333333, ans=0.1 2023-10-10 20:02:47,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=467296.6666666667, ans=0.09899494936611666 2023-10-10 20:03:01,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=467343.3333333333, ans=0.2 2023-10-10 20:03:12,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=467390.0, ans=10.0 2023-10-10 20:03:32,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=467483.3333333333, ans=0.025 2023-10-10 20:03:46,955 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.721e+02 1.879e+02 2.167e+02 3.100e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-10 20:03:49,671 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.31 vs. limit=15.0 2023-10-10 20:04:14,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=467670.0, ans=0.0 2023-10-10 20:04:14,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=467670.0, ans=0.125 2023-10-10 20:04:19,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=467670.0, ans=0.0 2023-10-10 20:04:21,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-10-10 20:04:24,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=467716.6666666667, ans=0.0 2023-10-10 20:04:32,665 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.91 vs. limit=15.0 2023-10-10 20:04:39,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=467763.3333333333, ans=0.0 2023-10-10 20:04:40,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=467763.3333333333, ans=0.05 2023-10-10 20:04:52,338 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.19 vs. limit=22.5 2023-10-10 20:05:07,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=467856.6666666667, ans=10.0 2023-10-10 20:05:08,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=467856.6666666667, ans=0.125 2023-10-10 20:05:09,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=467856.6666666667, ans=0.2 2023-10-10 20:05:11,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=467903.3333333333, ans=0.125 2023-10-10 20:05:14,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=467903.3333333333, ans=0.125 2023-10-10 20:05:31,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=467950.0, ans=0.125 2023-10-10 20:05:33,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=467996.6666666667, ans=0.1 2023-10-10 20:05:38,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.753e+02 1.992e+02 2.241e+02 3.342e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-10 20:06:18,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=468136.6666666667, ans=0.1 2023-10-10 20:06:30,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=468230.0, ans=0.0 2023-10-10 20:06:47,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=468276.6666666667, ans=0.0 2023-10-10 20:07:00,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=468370.0, ans=0.125 2023-10-10 20:07:11,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468416.6666666667, ans=0.1 2023-10-10 20:07:21,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=468463.3333333333, ans=0.125 2023-10-10 20:07:28,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.693e+02 1.861e+02 2.106e+02 3.160e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-10 20:07:29,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=468463.3333333333, ans=0.0 2023-10-10 20:07:41,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=468510.0, ans=0.0 2023-10-10 20:07:52,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-10 20:07:54,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=468556.6666666667, ans=0.0 2023-10-10 20:07:59,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=468556.6666666667, ans=0.07 2023-10-10 20:08:00,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=468603.3333333333, ans=0.0 2023-10-10 20:08:03,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=468603.3333333333, ans=0.125 2023-10-10 20:08:09,564 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.42 vs. limit=22.5 2023-10-10 20:08:15,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.08 vs. limit=15.0 2023-10-10 20:08:19,379 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-10-10 20:08:34,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=468696.6666666667, ans=0.09899494936611666 2023-10-10 20:08:55,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=468790.0, ans=0.125 2023-10-10 20:08:59,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=468836.6666666667, ans=0.0 2023-10-10 20:09:26,476 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.708e+02 1.895e+02 2.121e+02 3.347e+02, threshold=3.789e+02, percent-clipped=0.0 2023-10-10 20:09:36,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=468976.6666666667, ans=0.125 2023-10-10 20:09:47,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.63 vs. limit=22.5 2023-10-10 20:10:24,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=469116.6666666667, ans=0.0 2023-10-10 20:10:33,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469163.3333333333, ans=0.1 2023-10-10 20:10:45,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=469210.0, ans=0.09899494936611666 2023-10-10 20:10:58,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.22 vs. limit=15.0 2023-10-10 20:10:59,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=469303.3333333333, ans=0.125 2023-10-10 20:11:21,558 INFO [train.py:1031] (0/4) Epoch 8, batch 5000, loss[loss=0.2082, simple_loss=0.3011, pruned_loss=0.05768, over 16847.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2996, pruned_loss=0.06396, over 30118689.61 frames. ], batch size: 87, lr: 4.61e-03, grad_scale: 32.0 2023-10-10 20:11:26,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.737e+02 1.940e+02 2.252e+02 3.066e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-10 20:11:56,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=469536.6666666667, ans=0.125 2023-10-10 20:12:02,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=469536.6666666667, ans=0.0 2023-10-10 20:12:05,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=469536.6666666667, ans=0.125 2023-10-10 20:12:07,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=469583.3333333333, ans=0.2 2023-10-10 20:12:08,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=469583.3333333333, ans=0.125 2023-10-10 20:12:20,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=469630.0, ans=0.0 2023-10-10 20:12:24,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=469630.0, ans=0.2 2023-10-10 20:13:25,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.86 vs. limit=15.0 2023-10-10 20:13:26,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.711e+02 1.855e+02 2.065e+02 2.861e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-10 20:13:34,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=469910.0, ans=0.5 2023-10-10 20:14:10,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=470050.0, ans=0.125 2023-10-10 20:14:10,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=470050.0, ans=0.125 2023-10-10 20:14:15,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=470050.0, ans=0.125 2023-10-10 20:14:24,813 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:14:31,321 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.55 vs. limit=22.5 2023-10-10 20:15:05,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.71 vs. limit=6.0 2023-10-10 20:15:13,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-10-10 20:15:18,248 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.724e+02 1.969e+02 2.190e+02 3.408e+02, threshold=3.937e+02, percent-clipped=0.0 2023-10-10 20:15:24,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=470376.6666666667, ans=0.125 2023-10-10 20:15:33,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=470376.6666666667, ans=0.125 2023-10-10 20:15:42,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.89 vs. limit=22.5 2023-10-10 20:16:07,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=470516.6666666667, ans=0.0 2023-10-10 20:16:09,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=470516.6666666667, ans=0.125 2023-10-10 20:16:15,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=470563.3333333333, ans=0.125 2023-10-10 20:16:25,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=470610.0, ans=0.95 2023-10-10 20:16:31,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=470610.0, ans=0.2 2023-10-10 20:17:02,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=470750.0, ans=0.0 2023-10-10 20:17:17,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.659e+02 1.846e+02 2.005e+02 3.082e+02, threshold=3.691e+02, percent-clipped=0.0 2023-10-10 20:17:21,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.29 vs. limit=10.0 2023-10-10 20:17:38,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=470890.0, ans=0.2 2023-10-10 20:18:13,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-10-10 20:18:20,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471030.0, ans=0.1 2023-10-10 20:18:44,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=471123.3333333333, ans=0.0 2023-10-10 20:18:52,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=471170.0, ans=0.2 2023-10-10 20:19:06,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=471216.6666666667, ans=0.125 2023-10-10 20:19:10,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=471263.3333333333, ans=0.05 2023-10-10 20:19:11,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.42 vs. limit=15.0 2023-10-10 20:19:14,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.99 vs. limit=22.5 2023-10-10 20:19:15,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.623e+02 1.777e+02 2.124e+02 2.881e+02, threshold=3.553e+02, percent-clipped=0.0 2023-10-10 20:19:20,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471310.0, ans=0.1 2023-10-10 20:19:24,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=471310.0, ans=0.125 2023-10-10 20:20:06,702 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.29 vs. limit=10.0 2023-10-10 20:20:09,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471496.6666666667, ans=0.1 2023-10-10 20:20:21,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=471543.3333333333, ans=10.0 2023-10-10 20:20:57,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=471683.3333333333, ans=0.0 2023-10-10 20:21:03,670 INFO [train.py:1031] (0/4) Epoch 8, batch 5500, loss[loss=0.2059, simple_loss=0.2955, pruned_loss=0.05819, over 16855.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2995, pruned_loss=0.06372, over 30725623.51 frames. ], batch size: 146, lr: 4.60e-03, grad_scale: 16.0 2023-10-10 20:21:10,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.747e+02 1.995e+02 2.420e+02 4.219e+02, threshold=3.990e+02, percent-clipped=2.0 2023-10-10 20:21:17,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=471776.6666666667, ans=0.125 2023-10-10 20:21:29,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471823.3333333333, ans=0.1 2023-10-10 20:21:31,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=471823.3333333333, ans=0.125 2023-10-10 20:21:47,734 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:21:53,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=471916.6666666667, ans=0.125 2023-10-10 20:22:00,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=471963.3333333333, ans=0.125 2023-10-10 20:22:09,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=471963.3333333333, ans=0.0 2023-10-10 20:22:10,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=15.0 2023-10-10 20:22:23,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.03 vs. limit=15.0 2023-10-10 20:22:23,533 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.63 vs. limit=10.0 2023-10-10 20:22:38,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.45 vs. limit=15.0 2023-10-10 20:22:39,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=472103.3333333333, ans=0.015 2023-10-10 20:22:49,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=472150.0, ans=0.125 2023-10-10 20:22:52,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=472150.0, ans=0.125 2023-10-10 20:23:04,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.689e+02 1.951e+02 2.231e+02 3.232e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-10 20:23:09,319 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.26 vs. limit=15.0 2023-10-10 20:23:15,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=472243.3333333333, ans=0.125 2023-10-10 20:23:35,679 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.42 vs. limit=22.5 2023-10-10 20:23:44,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=472383.3333333333, ans=0.2 2023-10-10 20:23:47,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=472383.3333333333, ans=10.0 2023-10-10 20:23:52,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=472383.3333333333, ans=0.0 2023-10-10 20:23:58,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=472430.0, ans=0.2 2023-10-10 20:24:02,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=472430.0, ans=0.0 2023-10-10 20:24:10,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=472476.6666666667, ans=0.07 2023-10-10 20:24:13,770 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.52 vs. limit=15.0 2023-10-10 20:24:32,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=472570.0, ans=0.0 2023-10-10 20:25:01,320 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.765e+02 2.000e+02 2.318e+02 3.494e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-10 20:25:07,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=472710.0, ans=15.0 2023-10-10 20:25:26,119 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.14 vs. limit=15.0 2023-10-10 20:25:39,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=472803.3333333333, ans=0.125 2023-10-10 20:26:10,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472943.3333333333, ans=0.1 2023-10-10 20:26:27,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=472990.0, ans=0.125 2023-10-10 20:26:52,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=473083.3333333333, ans=0.2 2023-10-10 20:26:58,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=473130.0, ans=0.125 2023-10-10 20:26:58,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.62 vs. limit=15.0 2023-10-10 20:27:05,300 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.713e+02 1.881e+02 2.225e+02 3.408e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-10 20:27:20,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=473176.6666666667, ans=22.5 2023-10-10 20:27:30,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=473223.3333333333, ans=0.125 2023-10-10 20:27:52,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=473316.6666666667, ans=0.2 2023-10-10 20:28:02,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=473363.3333333333, ans=0.125 2023-10-10 20:28:06,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=473363.3333333333, ans=0.0 2023-10-10 20:28:06,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=473363.3333333333, ans=0.125 2023-10-10 20:28:20,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=473456.6666666667, ans=0.125 2023-10-10 20:28:22,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=473456.6666666667, ans=0.125 2023-10-10 20:28:41,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=473503.3333333333, ans=0.025 2023-10-10 20:29:01,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=473596.6666666667, ans=0.2 2023-10-10 20:29:03,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=473596.6666666667, ans=0.125 2023-10-10 20:29:06,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.645e+02 1.858e+02 2.035e+02 3.365e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-10 20:29:42,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=473736.6666666667, ans=0.125 2023-10-10 20:29:45,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=473736.6666666667, ans=0.125 2023-10-10 20:29:46,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=473783.3333333333, ans=0.09899494936611666 2023-10-10 20:30:03,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=15.0 2023-10-10 20:30:25,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=473923.3333333333, ans=0.125 2023-10-10 20:30:30,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=473923.3333333333, ans=0.125 2023-10-10 20:30:37,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=473970.0, ans=0.125 2023-10-10 20:30:46,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=474016.6666666667, ans=0.2 2023-10-10 20:30:56,179 INFO [train.py:1031] (0/4) Epoch 8, batch 6000, loss[loss=0.2095, simple_loss=0.2735, pruned_loss=0.07275, over 12354.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2998, pruned_loss=0.06395, over 31192958.61 frames. ], batch size: 440, lr: 4.59e-03, grad_scale: 32.0 2023-10-10 20:31:03,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=474063.3333333333, ans=0.125 2023-10-10 20:31:04,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.711e+02 1.821e+02 1.995e+02 2.939e+02, threshold=3.642e+02, percent-clipped=0.0 2023-10-10 20:31:19,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=474156.6666666667, ans=0.05 2023-10-10 20:31:41,609 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.30 vs. limit=22.5 2023-10-10 20:31:57,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=474296.6666666667, ans=0.125 2023-10-10 20:31:59,072 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:32:14,232 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-10-10 20:32:15,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=474343.3333333333, ans=0.125 2023-10-10 20:32:16,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=474343.3333333333, ans=0.125 2023-10-10 20:32:20,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=474390.0, ans=0.125 2023-10-10 20:32:22,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=474390.0, ans=0.09899494936611666 2023-10-10 20:32:32,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.10 vs. limit=15.0 2023-10-10 20:33:03,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.669e+02 1.914e+02 2.181e+02 3.600e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-10 20:33:14,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=474576.6666666667, ans=0.125 2023-10-10 20:33:18,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=474623.3333333333, ans=0.0 2023-10-10 20:33:29,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=474623.3333333333, ans=0.1 2023-10-10 20:33:34,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=474670.0, ans=0.2 2023-10-10 20:33:48,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=474716.6666666667, ans=0.0 2023-10-10 20:34:18,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=474810.0, ans=0.125 2023-10-10 20:34:27,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=474856.6666666667, ans=0.125 2023-10-10 20:34:38,242 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=22.5 2023-10-10 20:34:50,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=474950.0, ans=0.0 2023-10-10 20:35:02,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.799e+02 2.007e+02 2.466e+02 3.282e+02, threshold=4.013e+02, percent-clipped=0.0 2023-10-10 20:35:09,645 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:35:26,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-10-10 20:35:33,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=475136.6666666667, ans=15.0 2023-10-10 20:35:39,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=475136.6666666667, ans=0.0 2023-10-10 20:36:14,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.45 vs. limit=22.5 2023-10-10 20:36:25,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.63 vs. limit=22.5 2023-10-10 20:36:28,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=475323.3333333333, ans=0.125 2023-10-10 20:36:40,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.60 vs. limit=10.0 2023-10-10 20:36:51,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-10-10 20:36:57,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=475416.6666666667, ans=0.2 2023-10-10 20:37:08,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.758e+02 1.947e+02 2.235e+02 3.118e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-10 20:37:19,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=475510.0, ans=0.125 2023-10-10 20:37:23,295 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.46 vs. limit=5.0 2023-10-10 20:37:24,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475556.6666666667, ans=0.1 2023-10-10 20:37:39,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=475603.3333333333, ans=0.125 2023-10-10 20:37:42,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.11 vs. limit=15.0 2023-10-10 20:38:00,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=475650.0, ans=0.125 2023-10-10 20:38:10,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=475696.6666666667, ans=0.2 2023-10-10 20:38:16,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=475743.3333333333, ans=0.125 2023-10-10 20:38:21,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=475743.3333333333, ans=10.0 2023-10-10 20:38:32,926 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:38:32,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=475790.0, ans=0.125 2023-10-10 20:38:48,648 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.00 vs. limit=15.0 2023-10-10 20:38:50,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.41 vs. limit=22.5 2023-10-10 20:39:03,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=475883.3333333333, ans=0.0 2023-10-10 20:39:20,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.653e+02 1.864e+02 2.131e+02 2.951e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-10 20:39:31,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=475976.6666666667, ans=0.0 2023-10-10 20:40:04,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=476116.6666666667, ans=0.2 2023-10-10 20:40:16,106 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:40:31,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=476256.6666666667, ans=0.2 2023-10-10 20:40:41,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=476256.6666666667, ans=6.0 2023-10-10 20:40:44,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=476256.6666666667, ans=0.95 2023-10-10 20:40:51,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=476303.3333333333, ans=10.0 2023-10-10 20:41:02,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=476350.0, ans=0.125 2023-10-10 20:41:06,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=476350.0, ans=0.125 2023-10-10 20:41:08,573 INFO [train.py:1031] (0/4) Epoch 8, batch 6500, loss[loss=0.225, simple_loss=0.3144, pruned_loss=0.06782, over 16967.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.3002, pruned_loss=0.06421, over 31528450.52 frames. ], batch size: 77, lr: 4.58e-03, grad_scale: 32.0 2023-10-10 20:41:11,131 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.19 vs. limit=15.0 2023-10-10 20:41:18,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.746e+02 1.963e+02 2.272e+02 3.578e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-10 20:41:29,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=476443.3333333333, ans=0.125 2023-10-10 20:41:32,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=476443.3333333333, ans=0.125 2023-10-10 20:41:33,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=476490.0, ans=10.0 2023-10-10 20:41:50,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=476536.6666666667, ans=0.05 2023-10-10 20:41:53,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476536.6666666667, ans=0.1 2023-10-10 20:42:00,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=476536.6666666667, ans=0.0 2023-10-10 20:42:16,332 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:42:17,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.16 vs. limit=10.0 2023-10-10 20:42:22,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.95 vs. limit=6.0 2023-10-10 20:42:31,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=476676.6666666667, ans=0.0 2023-10-10 20:42:39,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=476723.3333333333, ans=0.125 2023-10-10 20:43:01,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=476770.0, ans=0.125 2023-10-10 20:43:02,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=476770.0, ans=0.0 2023-10-10 20:43:03,931 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-10-10 20:43:09,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=476816.6666666667, ans=0.0 2023-10-10 20:43:26,254 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.794e+02 2.019e+02 2.201e+02 3.279e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-10 20:43:37,648 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2023-10-10 20:43:49,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=476956.6666666667, ans=0.125 2023-10-10 20:43:59,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=477003.3333333333, ans=0.1 2023-10-10 20:44:01,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=477003.3333333333, ans=0.5 2023-10-10 20:44:01,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=477003.3333333333, ans=0.125 2023-10-10 20:44:19,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=477096.6666666667, ans=0.0 2023-10-10 20:44:27,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477096.6666666667, ans=0.1 2023-10-10 20:44:36,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=477143.3333333333, ans=0.2 2023-10-10 20:44:37,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=477143.3333333333, ans=0.125 2023-10-10 20:45:16,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=477283.3333333333, ans=0.125 2023-10-10 20:45:25,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.604e+02 1.805e+02 2.018e+02 2.481e+02, threshold=3.609e+02, percent-clipped=0.0 2023-10-10 20:45:32,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=477376.6666666667, ans=0.2 2023-10-10 20:45:50,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=477470.0, ans=0.0 2023-10-10 20:46:04,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=477516.6666666667, ans=0.125 2023-10-10 20:46:23,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=477563.3333333333, ans=0.125 2023-10-10 20:46:38,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=477610.0, ans=0.125 2023-10-10 20:46:53,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=477703.3333333333, ans=0.125 2023-10-10 20:47:00,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=477703.3333333333, ans=0.125 2023-10-10 20:47:14,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=477750.0, ans=0.125 2023-10-10 20:47:30,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.663e+02 1.840e+02 2.017e+02 2.985e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-10 20:47:43,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=477843.3333333333, ans=0.125 2023-10-10 20:47:49,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=477843.3333333333, ans=0.0 2023-10-10 20:48:05,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=477936.6666666667, ans=0.1 2023-10-10 20:48:22,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=477983.3333333333, ans=0.125 2023-10-10 20:48:36,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=478030.0, ans=0.125 2023-10-10 20:49:04,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=478123.3333333333, ans=0.0 2023-10-10 20:49:06,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=478123.3333333333, ans=0.0 2023-10-10 20:49:16,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=478170.0, ans=0.125 2023-10-10 20:49:21,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=478170.0, ans=0.0 2023-10-10 20:49:29,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=478216.6666666667, ans=0.05 2023-10-10 20:49:29,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-10-10 20:49:30,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=478216.6666666667, ans=0.0 2023-10-10 20:49:33,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=478216.6666666667, ans=0.2 2023-10-10 20:49:41,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=478263.3333333333, ans=0.125 2023-10-10 20:49:45,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.98 vs. limit=15.0 2023-10-10 20:49:47,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.642e+02 1.812e+02 2.106e+02 3.113e+02, threshold=3.625e+02, percent-clipped=0.0 2023-10-10 20:49:47,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=478263.3333333333, ans=0.035 2023-10-10 20:50:03,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=478356.6666666667, ans=0.125 2023-10-10 20:50:07,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=478356.6666666667, ans=0.125 2023-10-10 20:50:07,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=478356.6666666667, ans=0.0 2023-10-10 20:50:07,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=478356.6666666667, ans=0.07 2023-10-10 20:50:21,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=478450.0, ans=0.0 2023-10-10 20:50:29,503 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.37 vs. limit=15.0 2023-10-10 20:51:06,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478636.6666666667, ans=0.1 2023-10-10 20:51:19,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=478683.3333333333, ans=0.125 2023-10-10 20:51:25,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=478683.3333333333, ans=0.125 2023-10-10 20:51:26,675 INFO [train.py:1031] (0/4) Epoch 8, batch 7000, loss[loss=0.2102, simple_loss=0.2965, pruned_loss=0.0619, over 16356.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.3007, pruned_loss=0.06401, over 31842718.41 frames. ], batch size: 50, lr: 4.57e-03, grad_scale: 16.0 2023-10-10 20:51:31,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=478730.0, ans=0.1 2023-10-10 20:51:38,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.754e+02 1.865e+02 2.126e+02 2.843e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-10 20:51:39,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=478730.0, ans=0.125 2023-10-10 20:52:28,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=478916.6666666667, ans=0.125 2023-10-10 20:52:34,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=478963.3333333333, ans=0.125 2023-10-10 20:52:40,713 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:52:58,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=479056.6666666667, ans=0.125 2023-10-10 20:53:11,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=479103.3333333333, ans=0.125 2023-10-10 20:53:18,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=479150.0, ans=0.1 2023-10-10 20:53:18,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=479150.0, ans=0.125 2023-10-10 20:53:29,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=479196.6666666667, ans=0.0 2023-10-10 20:53:37,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.730e+02 1.929e+02 2.124e+02 2.702e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-10 20:53:38,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=479243.3333333333, ans=0.0 2023-10-10 20:53:39,862 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.51 vs. limit=15.0 2023-10-10 20:53:47,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=15.0 2023-10-10 20:54:11,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=479336.6666666667, ans=0.125 2023-10-10 20:54:12,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=479336.6666666667, ans=0.2 2023-10-10 20:54:14,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.11 vs. limit=22.5 2023-10-10 20:54:19,940 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.26 vs. limit=10.0 2023-10-10 20:54:28,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=479430.0, ans=0.1 2023-10-10 20:54:29,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=479430.0, ans=0.125 2023-10-10 20:54:40,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=479476.6666666667, ans=0.2 2023-10-10 20:54:43,420 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=15.0 2023-10-10 20:54:49,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=12.0 2023-10-10 20:55:03,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=479570.0, ans=0.125 2023-10-10 20:55:04,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=479570.0, ans=0.0 2023-10-10 20:55:06,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=479570.0, ans=0.125 2023-10-10 20:55:21,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=479663.3333333333, ans=0.125 2023-10-10 20:55:38,425 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.725e+02 1.978e+02 2.285e+02 3.639e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-10 20:55:50,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=479710.0, ans=0.125 2023-10-10 20:55:54,310 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.38 vs. limit=22.5 2023-10-10 20:56:02,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=479756.6666666667, ans=0.125 2023-10-10 20:56:04,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=479756.6666666667, ans=0.125 2023-10-10 20:56:05,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.23 vs. limit=15.0 2023-10-10 20:56:29,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=479850.0, ans=0.125 2023-10-10 20:56:31,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.57 vs. limit=10.0 2023-10-10 20:56:32,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=479850.0, ans=0.125 2023-10-10 20:56:35,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=479896.6666666667, ans=0.2 2023-10-10 20:56:45,575 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 20:57:17,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=480036.6666666667, ans=0.0 2023-10-10 20:57:38,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-10-10 20:57:46,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=480130.0, ans=0.125 2023-10-10 20:57:49,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=480130.0, ans=0.0 2023-10-10 20:57:52,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.627e+02 1.808e+02 2.015e+02 2.618e+02, threshold=3.616e+02, percent-clipped=0.0 2023-10-10 20:58:02,414 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-10-10 20:58:37,253 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=22.5 2023-10-10 20:58:38,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=480316.6666666667, ans=0.125 2023-10-10 20:59:13,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=480456.6666666667, ans=0.0 2023-10-10 20:59:55,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.683e+02 1.914e+02 2.576e+02 4.052e+02, threshold=3.828e+02, percent-clipped=3.0 2023-10-10 20:59:57,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480643.3333333333, ans=0.1 2023-10-10 20:59:58,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=480643.3333333333, ans=0.125 2023-10-10 21:00:00,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=480643.3333333333, ans=0.125 2023-10-10 21:00:13,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=480690.0, ans=0.0 2023-10-10 21:00:16,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=480690.0, ans=0.125 2023-10-10 21:00:19,715 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=22.5 2023-10-10 21:00:39,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=480783.3333333333, ans=0.035 2023-10-10 21:00:51,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=480876.6666666667, ans=0.125 2023-10-10 21:00:56,193 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:01:14,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=480923.3333333333, ans=0.125 2023-10-10 21:01:31,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=481016.6666666667, ans=0.2 2023-10-10 21:01:38,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.10 vs. limit=15.0 2023-10-10 21:01:40,232 INFO [train.py:1031] (0/4) Epoch 8, batch 7500, loss[loss=0.2142, simple_loss=0.2934, pruned_loss=0.06745, over 16566.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.3005, pruned_loss=0.06395, over 32065070.28 frames. ], batch size: 56, lr: 4.56e-03, grad_scale: 16.0 2023-10-10 21:01:42,913 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.78 vs. limit=22.5 2023-10-10 21:01:45,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.49 vs. limit=12.0 2023-10-10 21:01:50,900 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.703e+02 1.901e+02 2.198e+02 3.016e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-10 21:02:19,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=481203.3333333333, ans=0.95 2023-10-10 21:02:28,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=481250.0, ans=0.125 2023-10-10 21:02:39,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=481296.6666666667, ans=0.125 2023-10-10 21:03:14,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=481390.0, ans=0.125 2023-10-10 21:03:17,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481436.6666666667, ans=0.1 2023-10-10 21:03:21,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481436.6666666667, ans=0.1 2023-10-10 21:03:31,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=481483.3333333333, ans=0.2 2023-10-10 21:03:41,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=481530.0, ans=0.0 2023-10-10 21:03:45,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=481530.0, ans=0.0 2023-10-10 21:03:48,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.641e+02 1.846e+02 2.076e+02 2.938e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-10 21:03:53,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=481576.6666666667, ans=0.0 2023-10-10 21:04:08,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=481623.3333333333, ans=0.125 2023-10-10 21:04:15,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481623.3333333333, ans=0.1 2023-10-10 21:04:26,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=481670.0, ans=0.0 2023-10-10 21:04:37,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=481716.6666666667, ans=0.0 2023-10-10 21:04:41,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=481716.6666666667, ans=0.0 2023-10-10 21:04:41,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.74 vs. limit=12.0 2023-10-10 21:04:58,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=481763.3333333333, ans=0.025 2023-10-10 21:05:18,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.49 vs. limit=22.5 2023-10-10 21:05:20,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=481856.6666666667, ans=0.0 2023-10-10 21:05:25,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=481903.3333333333, ans=0.125 2023-10-10 21:05:27,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=481903.3333333333, ans=0.1 2023-10-10 21:05:27,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=481903.3333333333, ans=0.125 2023-10-10 21:05:34,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481903.3333333333, ans=0.1 2023-10-10 21:05:39,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=481950.0, ans=0.125 2023-10-10 21:05:44,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=481950.0, ans=0.125 2023-10-10 21:05:48,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-10-10 21:06:00,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.301e+02 1.738e+02 1.959e+02 2.216e+02 3.155e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-10 21:06:17,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482090.0, ans=0.1 2023-10-10 21:06:17,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=482090.0, ans=15.0 2023-10-10 21:06:25,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-10-10 21:06:39,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=482183.3333333333, ans=0.2 2023-10-10 21:06:51,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=482230.0, ans=0.125 2023-10-10 21:06:53,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.88 vs. limit=15.0 2023-10-10 21:06:54,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482230.0, ans=0.1 2023-10-10 21:07:20,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=482323.3333333333, ans=0.125 2023-10-10 21:07:21,629 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.93 vs. limit=10.0 2023-10-10 21:07:39,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=482416.6666666667, ans=0.0 2023-10-10 21:07:57,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.288e+02 1.783e+02 2.049e+02 2.311e+02 3.170e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-10 21:08:06,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=482510.0, ans=0.0 2023-10-10 21:08:06,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=482510.0, ans=0.0 2023-10-10 21:08:07,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=482510.0, ans=0.1 2023-10-10 21:08:18,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-10-10 21:08:42,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=482650.0, ans=0.125 2023-10-10 21:08:53,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-10-10 21:09:09,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=482743.3333333333, ans=0.1 2023-10-10 21:09:11,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=482790.0, ans=0.125 2023-10-10 21:09:58,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.688e+02 1.853e+02 2.144e+02 3.308e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-10 21:10:23,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.24 vs. limit=6.0 2023-10-10 21:10:48,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=483163.3333333333, ans=0.125 2023-10-10 21:10:55,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483163.3333333333, ans=0.1 2023-10-10 21:11:00,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=483210.0, ans=0.0 2023-10-10 21:11:33,747 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:11:36,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-10-10 21:11:37,569 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.64 vs. limit=15.0 2023-10-10 21:11:40,304 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.07 vs. limit=15.0 2023-10-10 21:11:44,075 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:11:46,536 INFO [train.py:1031] (0/4) Epoch 8, batch 8000, loss[loss=0.1982, simple_loss=0.2896, pruned_loss=0.05343, over 16646.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2997, pruned_loss=0.06328, over 32219780.76 frames. ], batch size: 61, lr: 4.55e-03, grad_scale: 32.0 2023-10-10 21:11:57,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.313e+02 1.553e+02 1.723e+02 1.912e+02 2.996e+02, threshold=3.446e+02, percent-clipped=0.0 2023-10-10 21:12:06,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=483443.3333333333, ans=0.0 2023-10-10 21:12:12,517 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.29 vs. limit=15.0 2023-10-10 21:12:52,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=483630.0, ans=0.125 2023-10-10 21:12:53,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=483676.6666666667, ans=0.125 2023-10-10 21:12:54,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=483676.6666666667, ans=0.125 2023-10-10 21:12:55,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=483676.6666666667, ans=0.125 2023-10-10 21:12:59,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=483676.6666666667, ans=0.125 2023-10-10 21:13:27,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=483816.6666666667, ans=0.0 2023-10-10 21:13:33,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=483816.6666666667, ans=0.125 2023-10-10 21:13:41,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=483863.3333333333, ans=0.0 2023-10-10 21:13:47,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.268e+02 1.855e+02 2.088e+02 2.446e+02 4.292e+02, threshold=4.177e+02, percent-clipped=2.0 2023-10-10 21:13:47,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=483910.0, ans=0.125 2023-10-10 21:14:04,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=483956.6666666667, ans=0.125 2023-10-10 21:14:09,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=484003.3333333333, ans=0.125 2023-10-10 21:14:44,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=484096.6666666667, ans=0.0 2023-10-10 21:15:06,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=484143.3333333333, ans=0.125 2023-10-10 21:15:26,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=484236.6666666667, ans=0.125 2023-10-10 21:15:29,437 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.84 vs. limit=22.5 2023-10-10 21:15:34,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=484283.3333333333, ans=0.015 2023-10-10 21:15:54,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=484330.0, ans=0.125 2023-10-10 21:16:01,422 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.722e+02 1.890e+02 2.150e+02 3.149e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-10 21:16:04,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=484376.6666666667, ans=0.125 2023-10-10 21:16:11,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=484376.6666666667, ans=0.2 2023-10-10 21:16:14,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=484423.3333333333, ans=0.125 2023-10-10 21:16:15,230 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=8.629e-02 2023-10-10 21:16:39,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=484516.6666666667, ans=0.0 2023-10-10 21:16:43,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=484516.6666666667, ans=0.125 2023-10-10 21:17:03,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.18 vs. limit=15.0 2023-10-10 21:17:19,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=484656.6666666667, ans=0.0 2023-10-10 21:17:30,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-10-10 21:17:55,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=484796.6666666667, ans=0.125 2023-10-10 21:17:55,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=484796.6666666667, ans=0.2 2023-10-10 21:17:57,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=484796.6666666667, ans=0.2 2023-10-10 21:17:58,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.629e+02 1.751e+02 1.999e+02 2.601e+02, threshold=3.502e+02, percent-clipped=0.0 2023-10-10 21:18:35,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=484983.3333333333, ans=0.125 2023-10-10 21:18:48,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=485030.0, ans=0.125 2023-10-10 21:18:48,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.27 vs. limit=15.0 2023-10-10 21:18:52,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=485076.6666666667, ans=0.0 2023-10-10 21:18:53,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=485076.6666666667, ans=0.0 2023-10-10 21:19:02,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485123.3333333333, ans=0.1 2023-10-10 21:19:15,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=485170.0, ans=0.125 2023-10-10 21:19:15,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=485170.0, ans=0.0 2023-10-10 21:19:16,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=485170.0, ans=0.2 2023-10-10 21:19:26,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.52 vs. limit=5.0 2023-10-10 21:19:31,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=485216.6666666667, ans=0.125 2023-10-10 21:19:33,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=485216.6666666667, ans=0.0 2023-10-10 21:19:36,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=485216.6666666667, ans=0.125 2023-10-10 21:19:51,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.685e+02 1.865e+02 2.061e+02 2.637e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-10 21:19:55,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=485310.0, ans=0.0 2023-10-10 21:19:57,605 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-104000.pt 2023-10-10 21:20:12,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=485356.6666666667, ans=0.2 2023-10-10 21:20:23,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=485403.3333333333, ans=0.2 2023-10-10 21:20:26,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=485403.3333333333, ans=0.125 2023-10-10 21:20:49,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=485496.6666666667, ans=0.2 2023-10-10 21:21:16,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.61 vs. limit=15.0 2023-10-10 21:21:21,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=485636.6666666667, ans=0.125 2023-10-10 21:21:30,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=485683.3333333333, ans=0.0 2023-10-10 21:21:44,000 INFO [train.py:1031] (0/4) Epoch 8, batch 8500, loss[loss=0.2189, simple_loss=0.3021, pruned_loss=0.06786, over 16832.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2998, pruned_loss=0.06316, over 32348812.84 frames. ], batch size: 188, lr: 4.54e-03, grad_scale: 32.0 2023-10-10 21:21:54,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.773e+02 2.025e+02 2.320e+02 3.386e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-10 21:22:11,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=485823.3333333333, ans=0.125 2023-10-10 21:22:12,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=485823.3333333333, ans=0.125 2023-10-10 21:22:17,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485870.0, ans=0.1 2023-10-10 21:22:30,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485916.6666666667, ans=0.1 2023-10-10 21:22:38,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=485916.6666666667, ans=0.125 2023-10-10 21:22:39,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=485916.6666666667, ans=0.2 2023-10-10 21:23:08,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-10-10 21:23:59,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.91 vs. limit=15.0 2023-10-10 21:24:02,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.722e+02 2.039e+02 2.371e+02 3.182e+02, threshold=4.077e+02, percent-clipped=0.0 2023-10-10 21:24:21,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=486290.0, ans=0.0 2023-10-10 21:24:39,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=486383.3333333333, ans=0.0 2023-10-10 21:25:15,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=486523.3333333333, ans=0.125 2023-10-10 21:25:24,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=486523.3333333333, ans=0.2 2023-10-10 21:25:46,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=486616.6666666667, ans=0.0 2023-10-10 21:25:48,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=486616.6666666667, ans=0.125 2023-10-10 21:25:52,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=486663.3333333333, ans=0.125 2023-10-10 21:26:04,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.227e+02 1.600e+02 1.762e+02 1.915e+02 2.781e+02, threshold=3.523e+02, percent-clipped=0.0 2023-10-10 21:26:27,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.42 vs. limit=15.0 2023-10-10 21:26:41,492 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.24 vs. limit=12.0 2023-10-10 21:26:47,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-10-10 21:26:48,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=486850.0, ans=0.015 2023-10-10 21:26:53,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=486896.6666666667, ans=0.125 2023-10-10 21:27:03,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.35 vs. limit=22.5 2023-10-10 21:27:18,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=486990.0, ans=0.125 2023-10-10 21:27:20,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486990.0, ans=0.1 2023-10-10 21:27:48,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=487083.3333333333, ans=0.125 2023-10-10 21:27:53,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=487130.0, ans=0.125 2023-10-10 21:28:01,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=487130.0, ans=0.0 2023-10-10 21:28:07,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.300e+02 1.597e+02 1.815e+02 2.037e+02 3.283e+02, threshold=3.630e+02, percent-clipped=0.0 2023-10-10 21:28:08,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=487176.6666666667, ans=0.0 2023-10-10 21:28:34,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=487270.0, ans=0.125 2023-10-10 21:28:39,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-10-10 21:29:03,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487410.0, ans=0.1 2023-10-10 21:29:07,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487410.0, ans=0.1 2023-10-10 21:29:16,419 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.64 vs. limit=22.5 2023-10-10 21:29:23,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=487503.3333333333, ans=0.0 2023-10-10 21:29:31,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487503.3333333333, ans=0.1 2023-10-10 21:29:50,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=487596.6666666667, ans=0.0 2023-10-10 21:29:59,901 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.291e+02 1.748e+02 1.941e+02 2.174e+02 3.709e+02, threshold=3.882e+02, percent-clipped=1.0 2023-10-10 21:30:13,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=487690.0, ans=0.125 2023-10-10 21:30:23,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=487736.6666666667, ans=0.2 2023-10-10 21:30:26,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=487736.6666666667, ans=0.2 2023-10-10 21:30:36,575 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:30:38,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=487783.3333333333, ans=0.0 2023-10-10 21:30:42,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=487830.0, ans=0.05 2023-10-10 21:30:55,882 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:31:03,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=487876.6666666667, ans=0.125 2023-10-10 21:31:05,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=487923.3333333333, ans=0.0 2023-10-10 21:31:24,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=487970.0, ans=0.0 2023-10-10 21:31:30,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=488016.6666666667, ans=0.2 2023-10-10 21:31:36,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=488016.6666666667, ans=0.0 2023-10-10 21:31:37,793 INFO [train.py:1031] (0/4) Epoch 8, batch 9000, loss[loss=0.2297, simple_loss=0.3174, pruned_loss=0.07103, over 16554.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2994, pruned_loss=0.06305, over 32460023.48 frames. ], batch size: 66, lr: 4.53e-03, grad_scale: 32.0 2023-10-10 21:31:38,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=488063.3333333333, ans=0.0 2023-10-10 21:31:49,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.802e+02 1.982e+02 2.288e+02 3.379e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-10 21:32:03,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=488156.6666666667, ans=0.125 2023-10-10 21:32:12,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=488203.3333333333, ans=0.125 2023-10-10 21:32:13,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=488203.3333333333, ans=0.0 2023-10-10 21:32:32,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.43 vs. limit=15.0 2023-10-10 21:32:43,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488343.3333333333, ans=0.1 2023-10-10 21:32:47,580 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.67 vs. limit=6.0 2023-10-10 21:32:48,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=488343.3333333333, ans=0.125 2023-10-10 21:32:58,189 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:33:06,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=488436.6666666667, ans=0.95 2023-10-10 21:33:08,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=488436.6666666667, ans=0.125 2023-10-10 21:33:11,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=488436.6666666667, ans=0.2 2023-10-10 21:33:27,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=488530.0, ans=0.125 2023-10-10 21:33:29,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=488530.0, ans=0.0 2023-10-10 21:33:30,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=488530.0, ans=0.0 2023-10-10 21:33:36,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.280e+02 1.683e+02 1.925e+02 2.213e+02 3.576e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-10 21:33:38,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=488576.6666666667, ans=0.0 2023-10-10 21:34:08,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=488716.6666666667, ans=0.2 2023-10-10 21:34:19,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=488763.3333333333, ans=0.0 2023-10-10 21:34:34,164 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.55 vs. limit=6.0 2023-10-10 21:34:42,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=488856.6666666667, ans=0.125 2023-10-10 21:34:48,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=488903.3333333333, ans=0.0 2023-10-10 21:34:54,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=488903.3333333333, ans=0.2 2023-10-10 21:34:54,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=488903.3333333333, ans=0.125 2023-10-10 21:34:56,846 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.95 vs. limit=6.0 2023-10-10 21:35:13,035 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.72 vs. limit=15.0 2023-10-10 21:35:22,586 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.814e+02 1.967e+02 2.239e+02 3.286e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-10 21:35:24,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489043.3333333333, ans=0.1 2023-10-10 21:35:38,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=489090.0, ans=0.125 2023-10-10 21:35:49,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=489136.6666666667, ans=0.07 2023-10-10 21:35:52,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=489183.3333333333, ans=0.125 2023-10-10 21:35:56,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=489183.3333333333, ans=0.125 2023-10-10 21:36:00,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489183.3333333333, ans=0.1 2023-10-10 21:36:00,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=489183.3333333333, ans=0.0 2023-10-10 21:36:33,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-10-10 21:37:07,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.812e+02 1.975e+02 2.235e+02 3.214e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-10 21:37:15,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=489510.0, ans=0.125 2023-10-10 21:37:15,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489510.0, ans=0.1 2023-10-10 21:37:20,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=489556.6666666667, ans=0.125 2023-10-10 21:37:29,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=489603.3333333333, ans=0.1 2023-10-10 21:37:42,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=489650.0, ans=0.0 2023-10-10 21:38:43,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=489836.6666666667, ans=0.125 2023-10-10 21:38:51,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=489883.3333333333, ans=0.125 2023-10-10 21:39:12,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.728e+02 1.912e+02 2.225e+02 3.045e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-10 21:39:18,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.86 vs. limit=15.0 2023-10-10 21:39:27,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=490023.3333333333, ans=0.2 2023-10-10 21:39:32,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=490070.0, ans=0.0 2023-10-10 21:39:33,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=490070.0, ans=0.2 2023-10-10 21:40:08,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=490210.0, ans=0.0 2023-10-10 21:40:14,629 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-10-10 21:40:17,729 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.03 vs. limit=15.0 2023-10-10 21:40:28,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=490256.6666666667, ans=0.125 2023-10-10 21:40:33,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=490303.3333333333, ans=0.0 2023-10-10 21:40:58,654 INFO [train.py:1031] (0/4) Epoch 8, batch 9500, loss[loss=0.2696, simple_loss=0.3317, pruned_loss=0.1038, over 15625.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.3, pruned_loss=0.06327, over 32547139.61 frames. ], batch size: 350, lr: 4.52e-03, grad_scale: 16.0 2023-10-10 21:41:11,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.762e+02 2.058e+02 2.509e+02 4.568e+02, threshold=4.117e+02, percent-clipped=7.0 2023-10-10 21:41:43,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=490583.3333333333, ans=0.0 2023-10-10 21:41:44,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=490583.3333333333, ans=0.1 2023-10-10 21:41:56,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2023-10-10 21:41:59,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=490630.0, ans=0.07 2023-10-10 21:42:08,926 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.60 vs. limit=10.0 2023-10-10 21:42:16,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-10-10 21:42:36,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=490770.0, ans=0.0 2023-10-10 21:43:00,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=490863.3333333333, ans=0.125 2023-10-10 21:43:04,875 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.705e+02 1.852e+02 2.033e+02 2.655e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-10 21:43:57,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=491096.6666666667, ans=0.125 2023-10-10 21:43:59,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=491096.6666666667, ans=0.0 2023-10-10 21:43:59,537 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-10-10 21:44:07,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=491143.3333333333, ans=0.125 2023-10-10 21:44:32,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=491236.6666666667, ans=0.125 2023-10-10 21:44:39,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=491283.3333333333, ans=0.1 2023-10-10 21:44:40,522 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.19 vs. limit=10.0 2023-10-10 21:44:48,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=491330.0, ans=0.05 2023-10-10 21:44:57,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.689e+02 1.867e+02 2.130e+02 2.848e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-10 21:45:05,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=491376.6666666667, ans=0.125 2023-10-10 21:45:11,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=491423.3333333333, ans=0.125 2023-10-10 21:45:35,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=491516.6666666667, ans=0.0 2023-10-10 21:46:01,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=491656.6666666667, ans=0.2 2023-10-10 21:46:21,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=491703.3333333333, ans=0.0 2023-10-10 21:46:29,078 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-10-10 21:46:54,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.654e+02 1.829e+02 2.208e+02 3.121e+02, threshold=3.658e+02, percent-clipped=0.0 2023-10-10 21:47:00,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=491843.3333333333, ans=0.125 2023-10-10 21:47:14,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=491936.6666666667, ans=0.0 2023-10-10 21:47:17,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=491936.6666666667, ans=0.0 2023-10-10 21:47:17,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.09 vs. limit=22.5 2023-10-10 21:47:23,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=491936.6666666667, ans=0.125 2023-10-10 21:47:52,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=492076.6666666667, ans=0.09899494936611666 2023-10-10 21:47:58,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=492076.6666666667, ans=0.125 2023-10-10 21:48:06,579 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.35 vs. limit=15.0 2023-10-10 21:48:25,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=492216.6666666667, ans=0.125 2023-10-10 21:48:25,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=492216.6666666667, ans=0.125 2023-10-10 21:48:39,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492263.3333333333, ans=0.1 2023-10-10 21:48:48,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.765e+02 1.966e+02 2.337e+02 3.374e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-10 21:48:49,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=492310.0, ans=0.125 2023-10-10 21:48:56,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=492356.6666666667, ans=0.125 2023-10-10 21:49:20,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=492450.0, ans=0.125 2023-10-10 21:49:22,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=492450.0, ans=0.2 2023-10-10 21:49:36,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=492496.6666666667, ans=0.0 2023-10-10 21:49:38,586 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:49:55,413 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.58 vs. limit=15.0 2023-10-10 21:50:04,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=492636.6666666667, ans=0.125 2023-10-10 21:50:05,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=492636.6666666667, ans=0.125 2023-10-10 21:50:17,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=492683.3333333333, ans=0.05 2023-10-10 21:50:18,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=492683.3333333333, ans=0.125 2023-10-10 21:50:22,838 INFO [train.py:1031] (0/4) Epoch 8, batch 10000, loss[loss=0.2133, simple_loss=0.3024, pruned_loss=0.0621, over 16917.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2991, pruned_loss=0.06283, over 32603366.83 frames. ], batch size: 123, lr: 4.50e-03, grad_scale: 32.0 2023-10-10 21:50:35,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.656e+02 1.863e+02 2.129e+02 3.621e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-10 21:50:48,082 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:50:50,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492823.3333333333, ans=0.1 2023-10-10 21:50:52,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=492870.0, ans=0.2 2023-10-10 21:51:04,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=492916.6666666667, ans=0.125 2023-10-10 21:51:15,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.16 vs. limit=15.0 2023-10-10 21:51:21,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=492963.3333333333, ans=0.2 2023-10-10 21:51:37,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=493056.6666666667, ans=0.125 2023-10-10 21:51:58,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=493103.3333333333, ans=0.07 2023-10-10 21:52:21,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=493196.6666666667, ans=0.125 2023-10-10 21:52:30,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.822e+02 2.037e+02 2.406e+02 3.425e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-10 21:52:44,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.81 vs. limit=22.5 2023-10-10 21:52:49,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=493290.0, ans=0.05 2023-10-10 21:52:56,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=493336.6666666667, ans=0.125 2023-10-10 21:52:57,290 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-10-10 21:53:15,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=493383.3333333333, ans=0.0 2023-10-10 21:53:19,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=493430.0, ans=0.0 2023-10-10 21:53:44,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=493523.3333333333, ans=0.0 2023-10-10 21:53:56,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-10-10 21:53:59,545 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:54:29,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.664e+02 1.870e+02 2.090e+02 3.364e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-10 21:54:32,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=493710.0, ans=0.125 2023-10-10 21:54:41,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=493756.6666666667, ans=0.125 2023-10-10 21:54:47,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=493756.6666666667, ans=0.125 2023-10-10 21:55:04,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=493850.0, ans=0.125 2023-10-10 21:55:05,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=493850.0, ans=0.0 2023-10-10 21:55:18,949 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 21:55:22,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=493943.3333333333, ans=0.125 2023-10-10 21:55:25,184 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.69 vs. limit=22.5 2023-10-10 21:55:35,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493990.0, ans=0.1 2023-10-10 21:55:38,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493990.0, ans=0.1 2023-10-10 21:55:41,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=493990.0, ans=0.125 2023-10-10 21:55:44,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=493990.0, ans=0.125 2023-10-10 21:56:01,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=494083.3333333333, ans=0.125 2023-10-10 21:56:06,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=494083.3333333333, ans=0.2 2023-10-10 21:56:14,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=494130.0, ans=0.2 2023-10-10 21:56:26,139 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.254e+02 1.723e+02 1.942e+02 2.179e+02 3.073e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-10 21:56:38,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=494223.3333333333, ans=0.125 2023-10-10 21:56:52,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=494270.0, ans=0.0 2023-10-10 21:57:11,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=494363.3333333333, ans=0.125 2023-10-10 21:57:19,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=494363.3333333333, ans=0.0 2023-10-10 21:57:22,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.64 vs. limit=6.0 2023-10-10 21:57:23,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=494410.0, ans=0.07 2023-10-10 21:57:30,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-10-10 21:58:06,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=494550.0, ans=0.0 2023-10-10 21:58:17,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.10 vs. limit=22.5 2023-10-10 21:58:21,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.85 vs. limit=22.5 2023-10-10 21:58:22,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.608e+02 1.751e+02 1.898e+02 2.611e+02, threshold=3.502e+02, percent-clipped=0.0 2023-10-10 21:58:43,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=494736.6666666667, ans=0.125 2023-10-10 21:58:44,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=494736.6666666667, ans=0.125 2023-10-10 21:59:08,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=494830.0, ans=0.0 2023-10-10 21:59:18,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-10-10 21:59:21,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=494876.6666666667, ans=0.1 2023-10-10 22:00:02,720 INFO [train.py:1031] (0/4) Epoch 8, batch 10500, loss[loss=0.2202, simple_loss=0.3035, pruned_loss=0.06845, over 17017.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2995, pruned_loss=0.0629, over 32650572.87 frames. ], batch size: 117, lr: 4.49e-03, grad_scale: 16.0 2023-10-10 22:00:17,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.294e+02 1.669e+02 1.876e+02 2.117e+02 3.190e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-10 22:00:35,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=495203.3333333333, ans=0.125 2023-10-10 22:00:38,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=495203.3333333333, ans=0.125 2023-10-10 22:00:42,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=495203.3333333333, ans=0.125 2023-10-10 22:00:45,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=495250.0, ans=0.125 2023-10-10 22:01:18,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=495343.3333333333, ans=0.125 2023-10-10 22:01:29,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=495390.0, ans=0.0 2023-10-10 22:01:40,900 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.98 vs. limit=22.5 2023-10-10 22:01:45,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=495436.6666666667, ans=0.2 2023-10-10 22:02:13,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=495576.6666666667, ans=0.1 2023-10-10 22:02:20,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.661e+02 1.844e+02 2.014e+02 2.750e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-10 22:02:27,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=495623.3333333333, ans=0.125 2023-10-10 22:02:29,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.03 vs. limit=15.0 2023-10-10 22:02:34,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=495623.3333333333, ans=0.125 2023-10-10 22:02:54,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=495716.6666666667, ans=0.0 2023-10-10 22:02:55,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=495716.6666666667, ans=0.0 2023-10-10 22:03:03,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=495763.3333333333, ans=0.125 2023-10-10 22:03:04,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=495763.3333333333, ans=0.2 2023-10-10 22:03:08,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=495763.3333333333, ans=0.0 2023-10-10 22:03:15,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=495810.0, ans=0.2 2023-10-10 22:03:20,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=495856.6666666667, ans=0.1 2023-10-10 22:03:29,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=495856.6666666667, ans=0.1 2023-10-10 22:03:34,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=495903.3333333333, ans=0.125 2023-10-10 22:03:40,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-10-10 22:03:55,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=495950.0, ans=0.1 2023-10-10 22:03:57,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=495950.0, ans=0.125 2023-10-10 22:04:04,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=495996.6666666667, ans=0.0 2023-10-10 22:04:06,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=495996.6666666667, ans=0.5 2023-10-10 22:04:14,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.588e+02 1.733e+02 2.027e+02 2.793e+02, threshold=3.466e+02, percent-clipped=0.0 2023-10-10 22:04:30,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=496090.0, ans=0.0 2023-10-10 22:04:33,075 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-10-10 22:04:35,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=496136.6666666667, ans=0.125 2023-10-10 22:05:53,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496463.3333333333, ans=0.1 2023-10-10 22:05:56,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=496463.3333333333, ans=0.0 2023-10-10 22:06:04,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=496510.0, ans=0.2 2023-10-10 22:06:06,510 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.889e+02 2.123e+02 2.345e+02 3.364e+02, threshold=4.245e+02, percent-clipped=0.0 2023-10-10 22:06:22,662 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.64 vs. limit=12.0 2023-10-10 22:06:31,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.14 vs. limit=15.0 2023-10-10 22:06:39,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=496650.0, ans=0.125 2023-10-10 22:06:41,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496650.0, ans=0.1 2023-10-10 22:06:50,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=496696.6666666667, ans=0.125 2023-10-10 22:07:02,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=496743.3333333333, ans=0.125 2023-10-10 22:07:14,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.09 vs. limit=10.0 2023-10-10 22:07:32,802 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.60 vs. limit=10.0 2023-10-10 22:07:44,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=496930.0, ans=0.0 2023-10-10 22:07:58,465 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.315e+02 1.646e+02 1.893e+02 2.254e+02 3.198e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-10 22:08:03,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=496976.6666666667, ans=0.125 2023-10-10 22:08:10,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=497023.3333333333, ans=0.0 2023-10-10 22:08:12,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=497023.3333333333, ans=0.2 2023-10-10 22:08:29,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=497116.6666666667, ans=0.0 2023-10-10 22:08:35,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=497116.6666666667, ans=0.125 2023-10-10 22:08:58,892 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.00 vs. limit=15.0 2023-10-10 22:09:08,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=497256.6666666667, ans=0.125 2023-10-10 22:09:37,061 INFO [train.py:1031] (0/4) Epoch 8, batch 11000, loss[loss=0.2389, simple_loss=0.3192, pruned_loss=0.07929, over 16446.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2993, pruned_loss=0.06289, over 32678271.24 frames. ], batch size: 266, lr: 4.48e-03, grad_scale: 32.0 2023-10-10 22:09:50,731 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.57 vs. limit=22.5 2023-10-10 22:09:51,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=497443.3333333333, ans=0.0 2023-10-10 22:09:52,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.938e+02 2.212e+02 2.515e+02 3.317e+02, threshold=4.424e+02, percent-clipped=0.0 2023-10-10 22:09:55,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497443.3333333333, ans=0.1 2023-10-10 22:09:55,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=497443.3333333333, ans=0.125 2023-10-10 22:10:03,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=497490.0, ans=0.0 2023-10-10 22:10:25,045 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-10-10 22:10:28,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=497583.3333333333, ans=0.2 2023-10-10 22:10:29,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=497583.3333333333, ans=0.0 2023-10-10 22:10:30,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=497583.3333333333, ans=0.0 2023-10-10 22:10:36,813 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:10:45,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=497676.6666666667, ans=0.0 2023-10-10 22:10:58,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=497723.3333333333, ans=0.125 2023-10-10 22:11:00,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=497723.3333333333, ans=0.125 2023-10-10 22:11:02,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=497723.3333333333, ans=0.025 2023-10-10 22:11:02,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=497723.3333333333, ans=0.0 2023-10-10 22:11:20,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=497816.6666666667, ans=0.1 2023-10-10 22:11:47,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=497910.0, ans=0.0 2023-10-10 22:11:53,600 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.734e+02 1.925e+02 2.226e+02 3.161e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-10 22:12:33,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.28 vs. limit=22.5 2023-10-10 22:13:20,180 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.92 vs. limit=15.0 2023-10-10 22:13:44,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=498376.6666666667, ans=0.125 2023-10-10 22:13:45,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=498376.6666666667, ans=12.0 2023-10-10 22:13:46,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.241e+02 1.580e+02 1.786e+02 2.051e+02 2.993e+02, threshold=3.573e+02, percent-clipped=0.0 2023-10-10 22:13:50,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=498376.6666666667, ans=0.0 2023-10-10 22:14:00,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=498423.3333333333, ans=0.05 2023-10-10 22:14:05,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=498470.0, ans=0.125 2023-10-10 22:14:17,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=498516.6666666667, ans=0.125 2023-10-10 22:14:18,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=498516.6666666667, ans=0.125 2023-10-10 22:14:31,792 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:14:32,122 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.13 vs. limit=10.0 2023-10-10 22:14:38,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=498563.3333333333, ans=0.1 2023-10-10 22:14:38,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=498563.3333333333, ans=0.95 2023-10-10 22:15:00,795 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:15:07,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=498703.3333333333, ans=0.125 2023-10-10 22:15:25,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=498750.0, ans=0.0 2023-10-10 22:15:27,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.33 vs. limit=10.0 2023-10-10 22:15:34,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=498796.6666666667, ans=0.2 2023-10-10 22:15:49,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.780e+02 2.123e+02 2.745e+02 4.022e+02, threshold=4.246e+02, percent-clipped=3.0 2023-10-10 22:15:53,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=498843.3333333333, ans=0.1 2023-10-10 22:15:55,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=498890.0, ans=22.5 2023-10-10 22:16:08,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=498936.6666666667, ans=0.0 2023-10-10 22:16:14,011 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.68 vs. limit=22.5 2023-10-10 22:16:37,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=499030.0, ans=0.07 2023-10-10 22:16:37,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=499030.0, ans=0.0 2023-10-10 22:16:54,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=499123.3333333333, ans=0.125 2023-10-10 22:16:58,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=499123.3333333333, ans=0.1 2023-10-10 22:17:10,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=499170.0, ans=0.2 2023-10-10 22:17:12,913 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=12.0 2023-10-10 22:17:13,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=499216.6666666667, ans=0.0 2023-10-10 22:17:15,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=499216.6666666667, ans=0.2 2023-10-10 22:17:22,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=499216.6666666667, ans=0.04949747468305833 2023-10-10 22:17:25,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499263.3333333333, ans=0.1 2023-10-10 22:17:36,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=499310.0, ans=0.125 2023-10-10 22:17:41,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.708e+02 1.901e+02 2.151e+02 3.057e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-10 22:18:32,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=499496.6666666667, ans=0.125 2023-10-10 22:19:12,010 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-10-10 22:19:26,945 INFO [train.py:1031] (0/4) Epoch 8, batch 11500, loss[loss=0.2057, simple_loss=0.2952, pruned_loss=0.05813, over 16839.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2988, pruned_loss=0.06257, over 32697317.39 frames. ], batch size: 67, lr: 4.47e-03, grad_scale: 32.0 2023-10-10 22:19:39,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.76 vs. limit=15.0 2023-10-10 22:19:43,582 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.764e+02 1.968e+02 2.369e+02 3.431e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-10 22:19:51,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=499823.3333333333, ans=0.0 2023-10-10 22:19:56,292 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-10-10 22:20:31,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=499963.3333333333, ans=0.2 2023-10-10 22:20:43,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=500010.0, ans=0.0 2023-10-10 22:21:00,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-10-10 22:21:06,609 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.66 vs. limit=15.0 2023-10-10 22:21:10,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-10-10 22:21:22,814 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:21:38,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=500243.3333333333, ans=0.125 2023-10-10 22:21:38,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.44 vs. limit=15.0 2023-10-10 22:21:41,870 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.318e+02 1.624e+02 1.765e+02 1.938e+02 2.401e+02, threshold=3.531e+02, percent-clipped=0.0 2023-10-10 22:21:42,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=500243.3333333333, ans=0.125 2023-10-10 22:21:45,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=500243.3333333333, ans=0.2 2023-10-10 22:21:50,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=500290.0, ans=0.125 2023-10-10 22:21:54,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.85 vs. limit=15.0 2023-10-10 22:22:01,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=500336.6666666667, ans=0.1 2023-10-10 22:22:01,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=500336.6666666667, ans=0.95 2023-10-10 22:22:01,471 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.04 vs. limit=10.0 2023-10-10 22:22:04,916 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:22:13,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=500383.3333333333, ans=0.125 2023-10-10 22:22:31,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=500476.6666666667, ans=0.2 2023-10-10 22:22:34,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500476.6666666667, ans=0.1 2023-10-10 22:22:58,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=500570.0, ans=0.05 2023-10-10 22:22:59,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=500570.0, ans=0.125 2023-10-10 22:23:12,174 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.57 vs. limit=15.0 2023-10-10 22:23:14,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.33 vs. limit=6.0 2023-10-10 22:23:20,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=500663.3333333333, ans=0.125 2023-10-10 22:23:24,109 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.83 vs. limit=10.0 2023-10-10 22:23:29,988 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.709e+02 1.881e+02 2.039e+02 2.801e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-10 22:23:30,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=500710.0, ans=0.07 2023-10-10 22:23:49,403 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.514e-03 2023-10-10 22:24:04,564 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.17 vs. limit=15.0 2023-10-10 22:24:30,293 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.48 vs. limit=15.0 2023-10-10 22:24:43,355 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-10-10 22:24:49,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-10-10 22:25:05,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=501083.3333333333, ans=0.125 2023-10-10 22:25:07,800 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-10-10 22:25:15,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.24 vs. limit=10.0 2023-10-10 22:25:27,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=501130.0, ans=0.1 2023-10-10 22:25:37,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.775e+02 2.121e+02 2.570e+02 3.401e+02, threshold=4.242e+02, percent-clipped=0.0 2023-10-10 22:25:59,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.14 vs. limit=12.0 2023-10-10 22:26:06,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=501316.6666666667, ans=0.125 2023-10-10 22:26:12,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=501316.6666666667, ans=0.125 2023-10-10 22:26:20,083 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.15 vs. limit=22.5 2023-10-10 22:26:31,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=501410.0, ans=0.0 2023-10-10 22:26:36,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=501410.0, ans=0.0 2023-10-10 22:26:37,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=501410.0, ans=0.0 2023-10-10 22:26:58,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=501503.3333333333, ans=0.05 2023-10-10 22:27:06,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=501550.0, ans=0.125 2023-10-10 22:27:15,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.98 vs. limit=22.5 2023-10-10 22:27:24,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=501596.6666666667, ans=0.05 2023-10-10 22:27:28,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=501643.3333333333, ans=0.0 2023-10-10 22:27:34,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.695e+02 1.850e+02 2.006e+02 3.432e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-10 22:27:50,446 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=15.0 2023-10-10 22:27:57,847 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.62 vs. limit=15.0 2023-10-10 22:28:05,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=501783.3333333333, ans=0.0 2023-10-10 22:28:38,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=501923.3333333333, ans=0.0 2023-10-10 22:28:46,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=501923.3333333333, ans=0.0 2023-10-10 22:28:55,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-10-10 22:29:11,929 INFO [train.py:1031] (0/4) Epoch 8, batch 12000, loss[loss=0.2004, simple_loss=0.2921, pruned_loss=0.05432, over 16633.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2988, pruned_loss=0.06232, over 32728529.13 frames. ], batch size: 56, lr: 4.46e-03, grad_scale: 32.0 2023-10-10 22:29:29,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.648e+02 1.857e+02 2.122e+02 3.242e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-10 22:29:29,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=502110.0, ans=0.0 2023-10-10 22:29:39,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=502156.6666666667, ans=0.0 2023-10-10 22:30:21,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=502343.3333333333, ans=0.125 2023-10-10 22:30:35,128 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.18 vs. limit=12.0 2023-10-10 22:30:57,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=502483.3333333333, ans=0.125 2023-10-10 22:31:11,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=502530.0, ans=0.0 2023-10-10 22:31:11,389 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.16 vs. limit=22.5 2023-10-10 22:31:19,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.652e+02 1.829e+02 2.185e+02 3.535e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-10 22:31:24,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=502623.3333333333, ans=0.125 2023-10-10 22:31:26,747 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:31:26,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.29 vs. limit=15.0 2023-10-10 22:31:48,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=12.0 2023-10-10 22:31:52,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=502716.6666666667, ans=0.125 2023-10-10 22:31:55,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=502763.3333333333, ans=0.2 2023-10-10 22:32:01,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502763.3333333333, ans=0.1 2023-10-10 22:32:10,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=502810.0, ans=0.2 2023-10-10 22:32:40,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=502950.0, ans=0.125 2023-10-10 22:33:08,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.769e+02 1.987e+02 2.395e+02 3.208e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-10 22:33:10,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-10-10 22:33:13,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=503090.0, ans=0.0 2023-10-10 22:33:21,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503090.0, ans=0.1 2023-10-10 22:33:29,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=503136.6666666667, ans=0.125 2023-10-10 22:33:43,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=503230.0, ans=0.0 2023-10-10 22:33:52,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-10-10 22:34:00,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-10 22:34:07,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=503323.3333333333, ans=0.125 2023-10-10 22:34:10,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.78 vs. limit=15.0 2023-10-10 22:34:11,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=503323.3333333333, ans=0.125 2023-10-10 22:34:16,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=503323.3333333333, ans=0.2 2023-10-10 22:34:37,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=503416.6666666667, ans=0.125 2023-10-10 22:34:41,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.19 vs. limit=10.0 2023-10-10 22:34:49,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=503463.3333333333, ans=0.0 2023-10-10 22:34:55,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=503510.0, ans=0.0 2023-10-10 22:34:58,719 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.734e+02 1.985e+02 2.261e+02 4.632e+02, threshold=3.971e+02, percent-clipped=1.0 2023-10-10 22:34:59,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=503510.0, ans=0.125 2023-10-10 22:35:36,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503696.6666666667, ans=0.1 2023-10-10 22:35:36,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=503696.6666666667, ans=0.95 2023-10-10 22:35:39,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=503696.6666666667, ans=0.1 2023-10-10 22:35:42,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=503696.6666666667, ans=0.0 2023-10-10 22:36:18,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=503836.6666666667, ans=0.015 2023-10-10 22:36:30,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=503883.3333333333, ans=0.0 2023-10-10 22:36:45,907 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.72 vs. limit=15.0 2023-10-10 22:36:53,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.766e+02 1.945e+02 2.216e+02 2.957e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-10 22:37:00,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=504023.3333333333, ans=0.125 2023-10-10 22:37:26,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.88 vs. limit=10.0 2023-10-10 22:37:27,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=504116.6666666667, ans=0.0 2023-10-10 22:37:27,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=504116.6666666667, ans=0.0 2023-10-10 22:37:37,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=504163.3333333333, ans=0.2 2023-10-10 22:37:43,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=504163.3333333333, ans=0.07 2023-10-10 22:37:50,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=504210.0, ans=0.125 2023-10-10 22:37:55,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=504210.0, ans=0.0 2023-10-10 22:37:58,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=504256.6666666667, ans=15.0 2023-10-10 22:38:00,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=504256.6666666667, ans=0.0 2023-10-10 22:38:09,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=504303.3333333333, ans=0.125 2023-10-10 22:38:14,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=504303.3333333333, ans=0.125 2023-10-10 22:38:26,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=504350.0, ans=0.125 2023-10-10 22:38:30,872 INFO [train.py:1031] (0/4) Epoch 8, batch 12500, loss[loss=0.1896, simple_loss=0.2915, pruned_loss=0.04386, over 16851.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2987, pruned_loss=0.06244, over 32764154.12 frames. ], batch size: 98, lr: 4.45e-03, grad_scale: 32.0 2023-10-10 22:38:33,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504396.6666666667, ans=0.1 2023-10-10 22:38:45,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=504443.3333333333, ans=0.0 2023-10-10 22:38:47,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=504443.3333333333, ans=0.1 2023-10-10 22:38:48,672 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.630e+02 1.848e+02 2.192e+02 3.388e+02, threshold=3.696e+02, percent-clipped=0.0 2023-10-10 22:38:49,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=504443.3333333333, ans=0.125 2023-10-10 22:38:53,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=504490.0, ans=0.125 2023-10-10 22:39:03,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=504536.6666666667, ans=0.125 2023-10-10 22:39:08,354 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=12.0 2023-10-10 22:39:11,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=504536.6666666667, ans=0.0 2023-10-10 22:40:06,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=504770.0, ans=0.125 2023-10-10 22:40:10,648 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.96 vs. limit=22.5 2023-10-10 22:40:17,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=504816.6666666667, ans=0.125 2023-10-10 22:40:28,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=504863.3333333333, ans=0.125 2023-10-10 22:40:30,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=504910.0, ans=0.05 2023-10-10 22:40:33,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=504910.0, ans=0.125 2023-10-10 22:40:36,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=504910.0, ans=0.125 2023-10-10 22:40:38,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.706e+02 1.964e+02 2.198e+02 3.199e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-10 22:40:57,434 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-10-10 22:41:03,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=505003.3333333333, ans=0.125 2023-10-10 22:41:04,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=505003.3333333333, ans=0.125 2023-10-10 22:41:39,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=505143.3333333333, ans=0.2 2023-10-10 22:41:45,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=505190.0, ans=0.0 2023-10-10 22:41:48,414 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:41:56,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=505236.6666666667, ans=0.125 2023-10-10 22:41:57,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505236.6666666667, ans=0.1 2023-10-10 22:41:57,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=505236.6666666667, ans=0.2 2023-10-10 22:42:06,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=505283.3333333333, ans=0.125 2023-10-10 22:42:18,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=505330.0, ans=0.125 2023-10-10 22:42:25,424 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.38 vs. limit=6.0 2023-10-10 22:42:26,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=505376.6666666667, ans=0.1 2023-10-10 22:42:29,410 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.273e+02 1.647e+02 1.780e+02 2.047e+02 2.520e+02, threshold=3.560e+02, percent-clipped=0.0 2023-10-10 22:42:36,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=505423.3333333333, ans=0.125 2023-10-10 22:42:42,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=505423.3333333333, ans=0.2 2023-10-10 22:43:01,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=505516.6666666667, ans=0.125 2023-10-10 22:43:02,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.12 vs. limit=15.0 2023-10-10 22:43:22,803 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:43:27,463 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-10-10 22:43:27,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.02 vs. limit=15.0 2023-10-10 22:43:34,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=505656.6666666667, ans=0.125 2023-10-10 22:43:36,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505656.6666666667, ans=0.1 2023-10-10 22:43:36,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=505656.6666666667, ans=0.0 2023-10-10 22:43:37,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=505703.3333333333, ans=0.0 2023-10-10 22:43:42,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-10-10 22:43:51,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=505750.0, ans=0.125 2023-10-10 22:43:53,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=505750.0, ans=0.125 2023-10-10 22:43:54,012 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-10-10 22:43:54,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=505750.0, ans=0.0 2023-10-10 22:43:58,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=505796.6666666667, ans=0.0 2023-10-10 22:44:02,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=505796.6666666667, ans=10.0 2023-10-10 22:44:03,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=505796.6666666667, ans=0.1 2023-10-10 22:44:17,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.731e+02 1.943e+02 2.160e+02 2.946e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-10 22:44:20,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=505890.0, ans=0.125 2023-10-10 22:44:25,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=505890.0, ans=0.125 2023-10-10 22:44:45,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=505983.3333333333, ans=0.5 2023-10-10 22:45:25,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=506123.3333333333, ans=0.5 2023-10-10 22:45:38,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=506170.0, ans=0.05 2023-10-10 22:46:03,896 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-10-10 22:46:12,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.695e+02 1.869e+02 2.112e+02 2.731e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-10 22:46:16,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.47 vs. limit=15.0 2023-10-10 22:46:21,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=506356.6666666667, ans=0.0 2023-10-10 22:46:25,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=506403.3333333333, ans=0.125 2023-10-10 22:47:22,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=506636.6666666667, ans=22.5 2023-10-10 22:47:35,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=506683.3333333333, ans=0.0 2023-10-10 22:47:41,470 INFO [train.py:1031] (0/4) Epoch 8, batch 13000, loss[loss=0.1958, simple_loss=0.2872, pruned_loss=0.05224, over 16894.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2995, pruned_loss=0.06258, over 32806340.40 frames. ], batch size: 87, lr: 4.44e-03, grad_scale: 32.0 2023-10-10 22:47:51,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=506776.6666666667, ans=0.2 2023-10-10 22:48:00,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.276e+02 1.651e+02 1.807e+02 2.072e+02 3.078e+02, threshold=3.614e+02, percent-clipped=0.0 2023-10-10 22:48:05,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=506823.3333333333, ans=0.125 2023-10-10 22:48:12,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=506823.3333333333, ans=0.125 2023-10-10 22:48:15,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=506823.3333333333, ans=0.025 2023-10-10 22:48:39,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=506916.6666666667, ans=0.09899494936611666 2023-10-10 22:48:40,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=506916.6666666667, ans=0.125 2023-10-10 22:48:44,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=506963.3333333333, ans=0.2 2023-10-10 22:49:23,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=507103.3333333333, ans=0.2 2023-10-10 22:49:29,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=507103.3333333333, ans=0.2 2023-10-10 22:49:30,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=507103.3333333333, ans=0.125 2023-10-10 22:49:45,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-10-10 22:50:00,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507243.3333333333, ans=0.1 2023-10-10 22:50:02,321 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.687e+02 1.929e+02 2.219e+02 3.912e+02, threshold=3.857e+02, percent-clipped=1.0 2023-10-10 22:50:12,036 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-10-10 22:50:13,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=507290.0, ans=0.0 2023-10-10 22:50:27,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507336.6666666667, ans=0.1 2023-10-10 22:50:36,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=507383.3333333333, ans=0.0 2023-10-10 22:50:44,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=507430.0, ans=0.0 2023-10-10 22:51:00,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-10-10 22:51:01,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=507476.6666666667, ans=0.125 2023-10-10 22:51:15,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=507570.0, ans=0.125 2023-10-10 22:51:17,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=507570.0, ans=0.125 2023-10-10 22:51:42,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=507663.3333333333, ans=0.0 2023-10-10 22:51:51,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507710.0, ans=0.1 2023-10-10 22:51:54,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=507710.0, ans=0.125 2023-10-10 22:51:59,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.675e+02 1.899e+02 2.135e+02 3.177e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-10 22:52:26,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=507850.0, ans=0.0 2023-10-10 22:52:43,472 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=15.0 2023-10-10 22:52:51,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=507943.3333333333, ans=0.0 2023-10-10 22:53:09,044 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.69 vs. limit=5.0 2023-10-10 22:53:11,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=508036.6666666667, ans=0.125 2023-10-10 22:53:13,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=508036.6666666667, ans=0.0 2023-10-10 22:53:13,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=508036.6666666667, ans=0.125 2023-10-10 22:53:36,136 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.98 vs. limit=15.0 2023-10-10 22:53:39,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=508130.0, ans=22.5 2023-10-10 22:53:39,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=508130.0, ans=0.2 2023-10-10 22:53:39,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=508130.0, ans=0.125 2023-10-10 22:53:44,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=508176.6666666667, ans=0.125 2023-10-10 22:53:52,155 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.695e+02 1.901e+02 2.207e+02 3.570e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-10 22:53:59,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=508223.3333333333, ans=0.02 2023-10-10 22:53:59,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=508223.3333333333, ans=0.1 2023-10-10 22:53:59,445 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=12.0 2023-10-10 22:54:18,388 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:54:29,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=508363.3333333333, ans=0.0 2023-10-10 22:54:42,655 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 22:54:58,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=508456.6666666667, ans=0.125 2023-10-10 22:55:00,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=508456.6666666667, ans=0.125 2023-10-10 22:55:25,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=508596.6666666667, ans=0.125 2023-10-10 22:55:37,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=508643.3333333333, ans=0.125 2023-10-10 22:55:44,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.709e+02 1.911e+02 2.190e+02 3.083e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-10 22:55:52,838 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.25 vs. limit=22.5 2023-10-10 22:56:16,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=508783.3333333333, ans=0.05 2023-10-10 22:56:29,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=508830.0, ans=6.0 2023-10-10 22:56:39,572 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-10-10 22:56:42,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=508923.3333333333, ans=0.1 2023-10-10 22:57:16,890 INFO [train.py:1031] (0/4) Epoch 8, batch 13500, loss[loss=0.2068, simple_loss=0.2913, pruned_loss=0.06112, over 16899.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2986, pruned_loss=0.06232, over 32782817.71 frames. ], batch size: 72, lr: 4.43e-03, grad_scale: 32.0 2023-10-10 22:57:28,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=509110.0, ans=0.125 2023-10-10 22:57:36,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.597e+02 1.795e+02 2.069e+02 3.001e+02, threshold=3.590e+02, percent-clipped=0.0 2023-10-10 22:57:36,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=509110.0, ans=0.1 2023-10-10 22:57:52,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.26 vs. limit=15.0 2023-10-10 22:58:08,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509250.0, ans=0.1 2023-10-10 22:58:32,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=509343.3333333333, ans=0.125 2023-10-10 22:58:48,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=509436.6666666667, ans=0.125 2023-10-10 22:59:11,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=509530.0, ans=0.1 2023-10-10 22:59:14,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=509530.0, ans=0.125 2023-10-10 22:59:18,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509576.6666666667, ans=0.1 2023-10-10 22:59:24,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=509576.6666666667, ans=10.0 2023-10-10 22:59:26,410 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.703e+02 1.895e+02 2.157e+02 2.727e+02, threshold=3.789e+02, percent-clipped=0.0 2023-10-10 22:59:36,508 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=12.0 2023-10-10 22:59:51,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=509716.6666666667, ans=0.0 2023-10-10 22:59:57,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=509763.3333333333, ans=0.04949747468305833 2023-10-10 23:00:02,404 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-8.pt 2023-10-10 23:00:33,930 INFO [train.py:1031] (0/4) Epoch 9, batch 0, loss[loss=0.1856, simple_loss=0.277, pruned_loss=0.04712, over 16793.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.277, pruned_loss=0.04712, over 16793.00 frames. ], batch size: 188, lr: 4.15e-03, grad_scale: 32.0 2023-10-10 23:00:33,931 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-10 23:00:42,270 INFO [train.py:1063] (0/4) Epoch 9, validation: loss=0.2237, simple_loss=0.3102, pruned_loss=0.06853, over 1020973.00 frames. 2023-10-10 23:00:42,270 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-10 23:00:45,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=509786.6666666667, ans=0.125 2023-10-10 23:01:01,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=509833.3333333333, ans=0.2 2023-10-10 23:01:13,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=509880.0, ans=0.0 2023-10-10 23:01:37,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=509973.3333333333, ans=0.2 2023-10-10 23:01:49,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=510020.0, ans=0.0 2023-10-10 23:01:55,614 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.258e+02 1.657e+02 1.836e+02 2.096e+02 3.061e+02, threshold=3.671e+02, percent-clipped=0.0 2023-10-10 23:02:07,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=510113.3333333333, ans=0.125 2023-10-10 23:02:08,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=510113.3333333333, ans=0.0 2023-10-10 23:02:08,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=510113.3333333333, ans=0.0 2023-10-10 23:02:13,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=510113.3333333333, ans=0.125 2023-10-10 23:02:26,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=510206.6666666667, ans=0.125 2023-10-10 23:02:30,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=510206.6666666667, ans=0.0 2023-10-10 23:02:30,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=510206.6666666667, ans=0.2 2023-10-10 23:02:50,365 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=15.0 2023-10-10 23:02:55,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=510300.0, ans=0.125 2023-10-10 23:03:02,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=510346.6666666667, ans=0.0 2023-10-10 23:03:05,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=510346.6666666667, ans=0.125 2023-10-10 23:03:09,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=510393.3333333333, ans=0.1 2023-10-10 23:03:20,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=510440.0, ans=0.0 2023-10-10 23:03:23,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=510440.0, ans=0.0 2023-10-10 23:03:25,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=510440.0, ans=0.0 2023-10-10 23:03:29,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-10-10 23:03:45,556 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.731e+02 2.020e+02 2.304e+02 2.929e+02, threshold=4.040e+02, percent-clipped=0.0 2023-10-10 23:04:19,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=510673.3333333333, ans=0.125 2023-10-10 23:04:31,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=510720.0, ans=0.2 2023-10-10 23:04:35,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=510720.0, ans=0.0 2023-10-10 23:04:40,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=510766.6666666667, ans=0.125 2023-10-10 23:04:56,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=510813.3333333333, ans=0.125 2023-10-10 23:05:06,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=510860.0, ans=0.0 2023-10-10 23:05:08,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=510860.0, ans=0.125 2023-10-10 23:05:14,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.55 vs. limit=22.5 2023-10-10 23:05:37,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=510953.3333333333, ans=0.125 2023-10-10 23:05:40,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=511000.0, ans=0.125 2023-10-10 23:05:43,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.650e+02 1.775e+02 2.030e+02 2.890e+02, threshold=3.550e+02, percent-clipped=0.0 2023-10-10 23:05:44,615 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-10-10 23:05:47,482 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.35 vs. limit=6.0 2023-10-10 23:05:54,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=511046.6666666667, ans=0.2 2023-10-10 23:06:11,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=511093.3333333333, ans=0.125 2023-10-10 23:06:11,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=511093.3333333333, ans=0.2 2023-10-10 23:06:13,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=511140.0, ans=0.1 2023-10-10 23:06:24,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=511140.0, ans=0.0 2023-10-10 23:06:34,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=511186.6666666667, ans=0.125 2023-10-10 23:06:51,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=511280.0, ans=0.0 2023-10-10 23:06:56,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=511280.0, ans=0.125 2023-10-10 23:07:03,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=511326.6666666667, ans=0.1 2023-10-10 23:07:11,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=511373.3333333333, ans=0.025 2023-10-10 23:07:14,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=511373.3333333333, ans=0.125 2023-10-10 23:07:17,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-10-10 23:07:33,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.664e+02 1.856e+02 2.171e+02 3.433e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-10 23:07:38,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=511466.6666666667, ans=0.125 2023-10-10 23:07:44,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.68 vs. limit=6.0 2023-10-10 23:07:47,896 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.67 vs. limit=22.5 2023-10-10 23:08:05,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=15.0 2023-10-10 23:08:27,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=511700.0, ans=0.125 2023-10-10 23:08:30,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=511700.0, ans=0.2 2023-10-10 23:08:37,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=511746.6666666667, ans=0.125 2023-10-10 23:08:52,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=511793.3333333333, ans=0.125 2023-10-10 23:08:59,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=511840.0, ans=0.1 2023-10-10 23:09:11,053 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:09:21,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=511933.3333333333, ans=0.0 2023-10-10 23:09:22,597 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:09:23,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=511933.3333333333, ans=0.0 2023-10-10 23:09:24,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.654e+02 1.828e+02 1.989e+02 2.734e+02, threshold=3.656e+02, percent-clipped=0.0 2023-10-10 23:09:26,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=511933.3333333333, ans=0.0 2023-10-10 23:09:27,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=511933.3333333333, ans=0.04949747468305833 2023-10-10 23:09:44,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=512026.6666666667, ans=0.0 2023-10-10 23:09:45,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=512026.6666666667, ans=0.125 2023-10-10 23:09:51,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=512026.6666666667, ans=0.125 2023-10-10 23:10:00,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=512073.3333333333, ans=0.0 2023-10-10 23:10:06,665 INFO [train.py:1031] (0/4) Epoch 9, batch 500, loss[loss=0.2056, simple_loss=0.2902, pruned_loss=0.06046, over 16916.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2982, pruned_loss=0.06232, over 7286880.94 frames. ], batch size: 110, lr: 4.14e-03, grad_scale: 32.0 2023-10-10 23:10:10,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=512120.0, ans=0.1 2023-10-10 23:10:13,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=512120.0, ans=0.125 2023-10-10 23:10:14,998 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=15.0 2023-10-10 23:10:19,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=512166.6666666667, ans=0.1 2023-10-10 23:10:23,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-10-10 23:10:24,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-10-10 23:11:00,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=512353.3333333333, ans=0.0 2023-10-10 23:11:07,741 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.80 vs. limit=22.5 2023-10-10 23:11:09,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=512353.3333333333, ans=0.125 2023-10-10 23:11:12,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=512400.0, ans=0.125 2023-10-10 23:11:16,205 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.668e+02 1.855e+02 2.269e+02 3.835e+02, threshold=3.711e+02, percent-clipped=1.0 2023-10-10 23:11:42,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-10-10 23:11:50,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=512540.0, ans=0.0 2023-10-10 23:11:52,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=512540.0, ans=0.0 2023-10-10 23:11:55,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=512586.6666666667, ans=0.125 2023-10-10 23:12:01,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=512586.6666666667, ans=0.0 2023-10-10 23:12:09,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=512633.3333333333, ans=0.025 2023-10-10 23:12:35,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=512726.6666666667, ans=0.0 2023-10-10 23:12:43,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=512773.3333333333, ans=0.0 2023-10-10 23:12:52,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.85 vs. limit=15.0 2023-10-10 23:12:53,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=512820.0, ans=0.125 2023-10-10 23:12:58,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.74 vs. limit=22.5 2023-10-10 23:13:02,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.209e+02 1.685e+02 1.884e+02 2.103e+02 2.806e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-10 23:13:40,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=513006.6666666667, ans=0.125 2023-10-10 23:13:43,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=513053.3333333333, ans=0.0 2023-10-10 23:13:47,098 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.02 vs. limit=15.0 2023-10-10 23:14:11,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=513193.3333333333, ans=10.0 2023-10-10 23:14:24,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=513193.3333333333, ans=0.125 2023-10-10 23:14:53,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=513333.3333333333, ans=0.125 2023-10-10 23:14:54,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.779e+02 2.127e+02 2.346e+02 3.752e+02, threshold=4.255e+02, percent-clipped=0.0 2023-10-10 23:14:55,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-10-10 23:15:11,521 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.88 vs. limit=5.0 2023-10-10 23:15:31,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.46 vs. limit=15.0 2023-10-10 23:15:33,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.12 vs. limit=15.0 2023-10-10 23:15:48,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.34 vs. limit=22.5 2023-10-10 23:15:54,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=513566.6666666667, ans=0.125 2023-10-10 23:16:09,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=513613.3333333333, ans=0.0 2023-10-10 23:16:11,766 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:16:18,830 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-10-10 23:16:28,565 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.38 vs. limit=10.0 2023-10-10 23:16:49,544 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.97 vs. limit=15.0 2023-10-10 23:16:52,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=513800.0, ans=0.0 2023-10-10 23:16:53,162 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.736e+02 1.976e+02 2.360e+02 3.214e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-10 23:17:08,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=513846.6666666667, ans=0.125 2023-10-10 23:17:18,025 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:17:30,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513940.0, ans=0.1 2023-10-10 23:17:42,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514033.3333333333, ans=0.1 2023-10-10 23:18:26,728 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=22.5 2023-10-10 23:18:39,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=514220.0, ans=0.2 2023-10-10 23:18:44,688 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.626e+02 1.743e+02 1.973e+02 2.695e+02, threshold=3.486e+02, percent-clipped=0.0 2023-10-10 23:18:56,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=514313.3333333333, ans=0.125 2023-10-10 23:18:59,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=514313.3333333333, ans=0.125 2023-10-10 23:18:59,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=514313.3333333333, ans=0.125 2023-10-10 23:19:08,224 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.74 vs. limit=6.0 2023-10-10 23:19:09,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.64 vs. limit=15.0 2023-10-10 23:19:15,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=514406.6666666667, ans=0.0 2023-10-10 23:19:23,002 INFO [train.py:1031] (0/4) Epoch 9, batch 1000, loss[loss=0.1924, simple_loss=0.2873, pruned_loss=0.04877, over 16798.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2991, pruned_loss=0.06263, over 12922463.17 frames. ], batch size: 98, lr: 4.13e-03, grad_scale: 32.0 2023-10-10 23:19:28,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=514453.3333333333, ans=0.125 2023-10-10 23:19:36,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=514500.0, ans=0.125 2023-10-10 23:19:49,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514546.6666666667, ans=0.1 2023-10-10 23:19:51,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=514546.6666666667, ans=0.2 2023-10-10 23:19:54,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=514593.3333333333, ans=0.0 2023-10-10 23:20:20,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-10-10 23:20:29,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.675e+02 1.911e+02 2.143e+02 3.098e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-10 23:20:34,082 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=22.5 2023-10-10 23:20:57,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=514873.3333333333, ans=0.05 2023-10-10 23:21:00,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=514873.3333333333, ans=0.0 2023-10-10 23:21:18,230 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=15.0 2023-10-10 23:21:28,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=514966.6666666667, ans=0.0 2023-10-10 23:21:31,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=514966.6666666667, ans=0.2 2023-10-10 23:21:42,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=515013.3333333333, ans=0.0 2023-10-10 23:22:07,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=515106.6666666667, ans=0.125 2023-10-10 23:22:14,710 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.19 vs. limit=15.0 2023-10-10 23:22:29,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.649e+02 1.859e+02 2.076e+02 2.784e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-10 23:22:48,733 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.77 vs. limit=22.5 2023-10-10 23:22:48,781 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2023-10-10 23:22:55,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=515293.3333333333, ans=0.2 2023-10-10 23:22:56,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=515293.3333333333, ans=0.125 2023-10-10 23:23:13,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=515386.6666666667, ans=0.125 2023-10-10 23:23:20,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-10 23:23:25,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=515433.3333333333, ans=0.125 2023-10-10 23:23:39,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=12.0 2023-10-10 23:23:47,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=515526.6666666667, ans=0.0 2023-10-10 23:23:54,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=515526.6666666667, ans=0.125 2023-10-10 23:24:01,233 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:24:16,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=515620.0, ans=0.2 2023-10-10 23:24:18,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=515620.0, ans=0.1 2023-10-10 23:24:20,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=515666.6666666667, ans=0.0 2023-10-10 23:24:20,916 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.32 vs. limit=15.0 2023-10-10 23:24:24,495 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.675e+02 1.838e+02 2.119e+02 2.965e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-10 23:25:04,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.22 vs. limit=22.5 2023-10-10 23:25:09,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=22.5 2023-10-10 23:25:11,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-10-10 23:25:14,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=515900.0, ans=0.95 2023-10-10 23:25:23,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=515900.0, ans=0.125 2023-10-10 23:25:56,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=516040.0, ans=0.125 2023-10-10 23:26:03,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.23 vs. limit=15.0 2023-10-10 23:26:14,388 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.256e+02 1.665e+02 1.833e+02 2.003e+02 2.892e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-10 23:26:16,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.06 vs. limit=22.5 2023-10-10 23:26:29,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=516180.0, ans=0.125 2023-10-10 23:26:29,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=12.0 2023-10-10 23:26:43,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=516273.3333333333, ans=0.02 2023-10-10 23:26:44,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=516273.3333333333, ans=0.125 2023-10-10 23:26:47,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=516273.3333333333, ans=0.0 2023-10-10 23:26:52,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=516320.0, ans=0.125 2023-10-10 23:26:55,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=516320.0, ans=0.07 2023-10-10 23:27:07,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=516366.6666666667, ans=0.05 2023-10-10 23:27:10,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516366.6666666667, ans=0.1 2023-10-10 23:27:36,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=516460.0, ans=0.125 2023-10-10 23:27:44,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.00 vs. limit=6.0 2023-10-10 23:27:45,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=516506.6666666667, ans=0.0 2023-10-10 23:27:49,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.10 vs. limit=15.0 2023-10-10 23:28:05,090 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-10-10 23:28:08,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.593e+02 1.810e+02 1.947e+02 2.398e+02, threshold=3.620e+02, percent-clipped=0.0 2023-10-10 23:28:16,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.24 vs. limit=15.0 2023-10-10 23:28:22,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=516646.6666666667, ans=0.1 2023-10-10 23:28:35,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=516740.0, ans=0.125 2023-10-10 23:28:37,974 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:28:41,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=516740.0, ans=0.1 2023-10-10 23:28:49,642 INFO [train.py:1031] (0/4) Epoch 9, batch 1500, loss[loss=0.1853, simple_loss=0.2789, pruned_loss=0.04587, over 16820.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2971, pruned_loss=0.06137, over 17345859.66 frames. ], batch size: 98, lr: 4.12e-03, grad_scale: 32.0 2023-10-10 23:28:52,702 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-10-10 23:28:58,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=516786.6666666667, ans=0.125 2023-10-10 23:29:13,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=516880.0, ans=0.125 2023-10-10 23:29:16,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=516880.0, ans=0.125 2023-10-10 23:29:32,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516926.6666666667, ans=0.1 2023-10-10 23:29:35,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=516926.6666666667, ans=0.125 2023-10-10 23:29:39,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=516973.3333333333, ans=0.125 2023-10-10 23:30:00,465 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:30:04,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.725e+02 1.886e+02 2.081e+02 2.812e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-10 23:30:16,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=517113.3333333333, ans=0.2 2023-10-10 23:30:24,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=517160.0, ans=0.0 2023-10-10 23:30:31,134 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:30:31,387 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=12.0 2023-10-10 23:30:32,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=517160.0, ans=0.0 2023-10-10 23:30:53,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=517253.3333333333, ans=0.0 2023-10-10 23:31:14,468 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.89 vs. limit=15.0 2023-10-10 23:31:19,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=15.0 2023-10-10 23:31:34,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=517393.3333333333, ans=0.1 2023-10-10 23:31:38,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=517440.0, ans=0.0 2023-10-10 23:31:59,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-10-10 23:32:05,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.647e+02 1.801e+02 2.106e+02 2.923e+02, threshold=3.602e+02, percent-clipped=0.0 2023-10-10 23:32:20,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-10-10 23:32:28,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=517626.6666666667, ans=0.125 2023-10-10 23:32:44,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517720.0, ans=0.1 2023-10-10 23:32:57,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=517766.6666666667, ans=0.2 2023-10-10 23:33:02,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=517766.6666666667, ans=0.07 2023-10-10 23:33:37,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.37 vs. limit=10.0 2023-10-10 23:33:39,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=517953.3333333333, ans=0.0 2023-10-10 23:33:43,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=517953.3333333333, ans=0.125 2023-10-10 23:33:45,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=517953.3333333333, ans=0.035 2023-10-10 23:33:49,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=518000.0, ans=0.0 2023-10-10 23:33:52,474 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.742e+02 1.920e+02 2.128e+02 3.345e+02, threshold=3.840e+02, percent-clipped=0.0 2023-10-10 23:33:56,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=518000.0, ans=0.125 2023-10-10 23:34:11,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=518093.3333333333, ans=0.2 2023-10-10 23:34:17,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=518093.3333333333, ans=0.125 2023-10-10 23:34:24,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=518093.3333333333, ans=0.07 2023-10-10 23:34:33,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=518140.0, ans=0.125 2023-10-10 23:34:50,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=518233.3333333333, ans=0.125 2023-10-10 23:34:52,997 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-10-10 23:34:53,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=518233.3333333333, ans=0.125 2023-10-10 23:34:53,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=518233.3333333333, ans=0.2 2023-10-10 23:34:55,188 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.21 vs. limit=12.0 2023-10-10 23:34:57,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.56 vs. limit=15.0 2023-10-10 23:35:03,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=518280.0, ans=0.2 2023-10-10 23:35:11,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=518326.6666666667, ans=0.125 2023-10-10 23:35:11,812 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.77 vs. limit=15.0 2023-10-10 23:35:12,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=518326.6666666667, ans=0.125 2023-10-10 23:35:29,293 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.31 vs. limit=10.0 2023-10-10 23:35:35,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=518420.0, ans=0.125 2023-10-10 23:35:39,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=518420.0, ans=0.0 2023-10-10 23:35:46,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.702e+02 1.886e+02 2.106e+02 2.813e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-10 23:36:06,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=518560.0, ans=0.125 2023-10-10 23:36:12,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=518560.0, ans=0.125 2023-10-10 23:36:34,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=518653.3333333333, ans=0.125 2023-10-10 23:37:10,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=518793.3333333333, ans=0.0 2023-10-10 23:37:14,107 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.59 vs. limit=12.0 2023-10-10 23:37:26,973 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-10-10 23:37:30,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=518886.6666666667, ans=0.125 2023-10-10 23:37:31,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.10 vs. limit=10.0 2023-10-10 23:37:38,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=518886.6666666667, ans=0.2 2023-10-10 23:37:49,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.716e+02 1.927e+02 2.183e+02 2.849e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-10 23:37:49,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518933.3333333333, ans=0.1 2023-10-10 23:38:04,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=518980.0, ans=0.2 2023-10-10 23:38:33,698 INFO [train.py:1031] (0/4) Epoch 9, batch 2000, loss[loss=0.2003, simple_loss=0.2946, pruned_loss=0.05301, over 16449.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2977, pruned_loss=0.06149, over 20767615.33 frames. ], batch size: 50, lr: 4.11e-03, grad_scale: 32.0 2023-10-10 23:38:47,083 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.56 vs. limit=12.0 2023-10-10 23:39:11,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=519213.3333333333, ans=0.125 2023-10-10 23:39:39,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=519353.3333333333, ans=0.125 2023-10-10 23:39:46,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.51 vs. limit=15.0 2023-10-10 23:39:55,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.608e+02 1.829e+02 2.171e+02 3.295e+02, threshold=3.657e+02, percent-clipped=0.0 2023-10-10 23:39:57,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519400.0, ans=0.1 2023-10-10 23:39:57,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=519400.0, ans=0.125 2023-10-10 23:39:59,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=519400.0, ans=0.125 2023-10-10 23:40:01,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=519400.0, ans=0.0 2023-10-10 23:40:03,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=519446.6666666667, ans=0.2 2023-10-10 23:40:04,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=519446.6666666667, ans=0.1 2023-10-10 23:40:11,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=519446.6666666667, ans=0.2 2023-10-10 23:40:25,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=519540.0, ans=0.125 2023-10-10 23:41:52,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=519773.3333333333, ans=0.125 2023-10-10 23:42:16,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.680e+02 1.845e+02 2.077e+02 3.271e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-10 23:42:55,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=520006.6666666667, ans=0.1 2023-10-10 23:42:58,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=520006.6666666667, ans=0.0 2023-10-10 23:43:00,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=520053.3333333333, ans=0.1 2023-10-10 23:43:05,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-10-10 23:43:11,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=520100.0, ans=0.0 2023-10-10 23:43:12,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=520100.0, ans=0.0 2023-10-10 23:43:23,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=520146.6666666667, ans=10.0 2023-10-10 23:43:30,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=520146.6666666667, ans=0.125 2023-10-10 23:43:47,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=520240.0, ans=0.125 2023-10-10 23:43:48,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=520240.0, ans=0.125 2023-10-10 23:44:04,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=520333.3333333333, ans=0.0 2023-10-10 23:44:09,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.743e+02 2.004e+02 2.204e+02 2.872e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-10 23:44:33,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=520426.6666666667, ans=0.0 2023-10-10 23:44:41,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=520473.3333333333, ans=0.0 2023-10-10 23:44:49,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=520520.0, ans=0.1 2023-10-10 23:45:10,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=520613.3333333333, ans=0.125 2023-10-10 23:45:24,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2023-10-10 23:45:58,894 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.703e+02 1.875e+02 2.055e+02 3.041e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-10 23:46:16,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=520893.3333333333, ans=0.125 2023-10-10 23:46:21,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=520893.3333333333, ans=0.09899494936611666 2023-10-10 23:46:27,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=520940.0, ans=0.125 2023-10-10 23:46:40,816 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.49 vs. limit=15.0 2023-10-10 23:47:07,817 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:47:12,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=521126.6666666667, ans=0.1 2023-10-10 23:47:24,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=521173.3333333333, ans=0.125 2023-10-10 23:47:44,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=521266.6666666667, ans=0.0 2023-10-10 23:47:46,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.737e+02 1.986e+02 2.280e+02 3.860e+02, threshold=3.971e+02, percent-clipped=1.0 2023-10-10 23:47:53,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=521313.3333333333, ans=0.125 2023-10-10 23:47:56,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=521313.3333333333, ans=0.0 2023-10-10 23:47:58,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=521313.3333333333, ans=0.125 2023-10-10 23:48:07,298 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-10 23:48:24,368 INFO [train.py:1031] (0/4) Epoch 9, batch 2500, loss[loss=0.2234, simple_loss=0.3104, pruned_loss=0.06825, over 16344.00 frames. ], tot_loss[loss=0.211, simple_loss=0.298, pruned_loss=0.06203, over 23410121.75 frames. ], batch size: 50, lr: 4.10e-03, grad_scale: 32.0 2023-10-10 23:48:34,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=521500.0, ans=0.0 2023-10-10 23:48:41,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=521500.0, ans=0.1 2023-10-10 23:48:55,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=521546.6666666667, ans=0.125 2023-10-10 23:49:01,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=521593.3333333333, ans=0.125 2023-10-10 23:49:02,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=521593.3333333333, ans=0.2 2023-10-10 23:49:07,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=521640.0, ans=0.125 2023-10-10 23:49:23,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=521686.6666666667, ans=0.125 2023-10-10 23:49:25,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=521686.6666666667, ans=0.125 2023-10-10 23:49:33,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.804e+02 2.002e+02 2.243e+02 3.113e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-10 23:49:51,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=15.0 2023-10-10 23:49:57,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=521826.6666666667, ans=0.09899494936611666 2023-10-10 23:49:57,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=521826.6666666667, ans=0.125 2023-10-10 23:50:05,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=521873.3333333333, ans=0.0 2023-10-10 23:50:30,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=521966.6666666667, ans=0.125 2023-10-10 23:50:37,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-10 23:50:41,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=522013.3333333333, ans=0.05 2023-10-10 23:50:48,506 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-10-10 23:50:56,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=522106.6666666667, ans=0.0 2023-10-10 23:51:07,481 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-10-10 23:51:10,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=522153.3333333333, ans=0.125 2023-10-10 23:51:13,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=522153.3333333333, ans=0.0 2023-10-10 23:51:26,071 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.689e+02 1.947e+02 2.156e+02 3.168e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-10 23:51:34,644 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-10-10 23:51:44,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=522293.3333333333, ans=0.0 2023-10-10 23:51:50,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=522293.3333333333, ans=0.125 2023-10-10 23:51:58,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.40 vs. limit=22.5 2023-10-10 23:51:59,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=522340.0, ans=0.125 2023-10-10 23:52:18,235 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.78 vs. limit=15.0 2023-10-10 23:52:39,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=522480.0, ans=0.0 2023-10-10 23:52:41,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=522526.6666666667, ans=0.125 2023-10-10 23:52:49,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=522526.6666666667, ans=0.125 2023-10-10 23:52:56,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=522573.3333333333, ans=0.2 2023-10-10 23:53:11,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=522620.0, ans=0.125 2023-10-10 23:53:12,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=522620.0, ans=0.0 2023-10-10 23:53:16,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=522620.0, ans=0.2 2023-10-10 23:53:21,683 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-112000.pt 2023-10-10 23:53:27,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-10-10 23:53:30,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.762e+02 2.055e+02 2.342e+02 3.249e+02, threshold=4.109e+02, percent-clipped=0.0 2023-10-10 23:53:39,061 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.64 vs. limit=15.0 2023-10-10 23:54:01,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=522806.6666666667, ans=0.125 2023-10-10 23:54:18,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.48 vs. limit=22.5 2023-10-10 23:54:44,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=522946.6666666667, ans=0.125 2023-10-10 23:54:45,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.05 vs. limit=15.0 2023-10-10 23:54:46,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=522993.3333333333, ans=0.0 2023-10-10 23:54:48,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=522993.3333333333, ans=0.125 2023-10-10 23:55:02,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=523040.0, ans=0.125 2023-10-10 23:55:12,871 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.22 vs. limit=15.0 2023-10-10 23:55:25,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=523133.3333333333, ans=0.125 2023-10-10 23:55:27,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=523133.3333333333, ans=0.125 2023-10-10 23:55:32,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.316e+02 1.661e+02 1.839e+02 2.137e+02 2.865e+02, threshold=3.678e+02, percent-clipped=0.0 2023-10-10 23:55:35,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=523133.3333333333, ans=0.0 2023-10-10 23:55:42,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=523180.0, ans=0.1 2023-10-10 23:56:23,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=523320.0, ans=0.125 2023-10-10 23:56:28,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=523366.6666666667, ans=0.125 2023-10-10 23:56:37,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=523413.3333333333, ans=0.125 2023-10-10 23:56:52,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=523460.0, ans=0.125 2023-10-10 23:56:58,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=523460.0, ans=0.0 2023-10-10 23:56:58,595 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.52 vs. limit=15.0 2023-10-10 23:57:03,513 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.38 vs. limit=15.0 2023-10-10 23:57:08,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=523506.6666666667, ans=0.1 2023-10-10 23:57:30,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.743e+02 2.036e+02 2.306e+02 2.961e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-10 23:57:32,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=523600.0, ans=0.125 2023-10-10 23:57:45,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=523693.3333333333, ans=0.07 2023-10-10 23:57:57,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=523740.0, ans=0.125 2023-10-10 23:58:02,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=523740.0, ans=0.0 2023-10-10 23:58:08,585 INFO [train.py:1031] (0/4) Epoch 9, batch 3000, loss[loss=0.1936, simple_loss=0.2853, pruned_loss=0.05098, over 16008.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2971, pruned_loss=0.06184, over 25484540.80 frames. ], batch size: 43, lr: 4.09e-03, grad_scale: 16.0 2023-10-10 23:58:36,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-10 23:58:40,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=523880.0, ans=0.0 2023-10-10 23:58:41,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=523926.6666666667, ans=0.125 2023-10-10 23:59:09,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=524020.0, ans=0.0 2023-10-10 23:59:11,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=524020.0, ans=0.125 2023-10-10 23:59:24,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.692e+02 1.846e+02 2.057e+02 2.838e+02, threshold=3.692e+02, percent-clipped=0.0 2023-10-10 23:59:42,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=524160.0, ans=0.1 2023-10-10 23:59:48,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=524160.0, ans=0.0 2023-10-10 23:59:52,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=524206.6666666667, ans=0.125 2023-10-10 23:59:54,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=524206.6666666667, ans=0.0 2023-10-11 00:00:20,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=524300.0, ans=0.125 2023-10-11 00:00:26,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=524300.0, ans=0.0 2023-10-11 00:00:28,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=524346.6666666666, ans=0.125 2023-10-11 00:00:33,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=524346.6666666666, ans=0.125 2023-10-11 00:00:52,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=524440.0, ans=0.125 2023-10-11 00:00:54,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=524440.0, ans=0.125 2023-10-11 00:00:55,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=524440.0, ans=0.125 2023-10-11 00:01:18,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.648e+02 1.996e+02 2.530e+02 3.660e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-11 00:01:30,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.30 vs. limit=22.5 2023-10-11 00:01:31,844 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.16 vs. limit=22.5 2023-10-11 00:01:34,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=524626.6666666666, ans=0.125 2023-10-11 00:01:47,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=524673.3333333334, ans=0.125 2023-10-11 00:01:57,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=524720.0, ans=0.125 2023-10-11 00:02:22,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.48 vs. limit=15.0 2023-10-11 00:02:34,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=524813.3333333334, ans=0.2 2023-10-11 00:02:46,115 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-10-11 00:03:23,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.688e+02 1.937e+02 2.215e+02 3.236e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-11 00:03:28,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=525046.6666666666, ans=0.0 2023-10-11 00:03:28,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=525046.6666666666, ans=0.125 2023-10-11 00:03:40,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=525093.3333333334, ans=0.125 2023-10-11 00:03:50,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525140.0, ans=0.1 2023-10-11 00:03:56,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=525140.0, ans=0.125 2023-10-11 00:04:34,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=525326.6666666666, ans=0.0 2023-10-11 00:05:15,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.686e+02 1.881e+02 2.242e+02 3.098e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-11 00:05:18,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=525513.3333333334, ans=0.125 2023-10-11 00:05:34,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=525560.0, ans=0.2 2023-10-11 00:05:34,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=525560.0, ans=0.0 2023-10-11 00:05:35,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=525560.0, ans=0.125 2023-10-11 00:05:48,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=525606.6666666666, ans=0.125 2023-10-11 00:05:50,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=525606.6666666666, ans=0.125 2023-10-11 00:06:02,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=525653.3333333334, ans=0.1 2023-10-11 00:06:11,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=525700.0, ans=0.0 2023-10-11 00:06:21,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=525746.6666666666, ans=0.125 2023-10-11 00:06:26,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=525746.6666666666, ans=0.0 2023-10-11 00:06:42,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=525840.0, ans=0.05 2023-10-11 00:07:07,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.717e+02 1.863e+02 2.032e+02 2.776e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-11 00:07:08,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=525933.3333333334, ans=0.125 2023-10-11 00:07:08,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=525933.3333333334, ans=0.2 2023-10-11 00:07:13,620 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.463e-02 2023-10-11 00:07:19,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=525980.0, ans=0.0 2023-10-11 00:07:39,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=526073.3333333334, ans=0.125 2023-10-11 00:07:45,737 INFO [train.py:1031] (0/4) Epoch 9, batch 3500, loss[loss=0.2087, simple_loss=0.3003, pruned_loss=0.05856, over 15924.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.297, pruned_loss=0.06177, over 27124967.05 frames. ], batch size: 43, lr: 4.08e-03, grad_scale: 16.0 2023-10-11 00:07:54,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=526166.6666666666, ans=0.125 2023-10-11 00:08:01,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=15.0 2023-10-11 00:08:02,301 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.16 vs. limit=15.0 2023-10-11 00:08:12,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=526213.3333333334, ans=0.125 2023-10-11 00:08:23,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.62 vs. limit=15.0 2023-10-11 00:09:02,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.776e+02 1.902e+02 2.199e+02 3.450e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-11 00:09:34,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=526540.0, ans=15.0 2023-10-11 00:09:42,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=526540.0, ans=0.125 2023-10-11 00:09:42,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=526540.0, ans=0.2 2023-10-11 00:09:45,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=22.5 2023-10-11 00:09:46,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=526586.6666666666, ans=0.125 2023-10-11 00:10:06,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=526680.0, ans=0.125 2023-10-11 00:10:28,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=526726.6666666666, ans=0.125 2023-10-11 00:11:02,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.679e+02 1.934e+02 2.270e+02 3.371e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-11 00:11:03,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=526866.6666666666, ans=0.2 2023-10-11 00:11:08,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=526913.3333333334, ans=0.125 2023-10-11 00:11:09,853 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.80 vs. limit=15.0 2023-10-11 00:11:23,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=526960.0, ans=0.2 2023-10-11 00:11:37,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=527006.6666666666, ans=0.1 2023-10-11 00:11:44,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=527053.3333333334, ans=0.0 2023-10-11 00:11:44,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-11 00:12:14,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.29 vs. limit=15.0 2023-10-11 00:12:27,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=527193.3333333334, ans=0.1 2023-10-11 00:12:43,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=527286.6666666666, ans=0.125 2023-10-11 00:12:49,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=527286.6666666666, ans=0.0 2023-10-11 00:12:51,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=527286.6666666666, ans=0.2 2023-10-11 00:12:52,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=527286.6666666666, ans=0.0 2023-10-11 00:13:04,583 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.327e+02 1.649e+02 1.778e+02 2.034e+02 3.027e+02, threshold=3.556e+02, percent-clipped=0.0 2023-10-11 00:13:09,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-10-11 00:13:10,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=527380.0, ans=0.125 2023-10-11 00:13:13,960 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:13:43,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=527520.0, ans=0.1 2023-10-11 00:14:11,460 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.48 vs. limit=15.0 2023-10-11 00:14:23,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=527660.0, ans=0.035 2023-10-11 00:14:28,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527706.6666666666, ans=0.1 2023-10-11 00:14:32,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=527706.6666666666, ans=0.1 2023-10-11 00:14:34,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=527706.6666666666, ans=0.125 2023-10-11 00:14:51,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=527800.0, ans=0.0 2023-10-11 00:14:59,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.260e+02 1.634e+02 1.812e+02 2.083e+02 2.773e+02, threshold=3.624e+02, percent-clipped=0.0 2023-10-11 00:15:15,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=527893.3333333334, ans=0.0 2023-10-11 00:15:28,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527940.0, ans=0.1 2023-10-11 00:15:41,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=527986.6666666666, ans=0.1 2023-10-11 00:15:44,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=22.5 2023-10-11 00:15:50,705 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.24 vs. limit=15.0 2023-10-11 00:15:54,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=528033.3333333334, ans=0.125 2023-10-11 00:16:25,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=528173.3333333334, ans=0.125 2023-10-11 00:16:33,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=528220.0, ans=10.0 2023-10-11 00:16:41,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=528266.6666666666, ans=0.125 2023-10-11 00:16:50,317 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.186e+02 1.645e+02 1.994e+02 2.219e+02 3.079e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-11 00:16:50,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=528266.6666666666, ans=0.125 2023-10-11 00:17:01,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=528313.3333333334, ans=0.07 2023-10-11 00:17:06,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=528360.0, ans=0.125 2023-10-11 00:17:11,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=528360.0, ans=0.125 2023-10-11 00:17:21,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=528406.6666666666, ans=0.1 2023-10-11 00:17:22,764 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:17:26,107 INFO [train.py:1031] (0/4) Epoch 9, batch 4000, loss[loss=0.2008, simple_loss=0.2999, pruned_loss=0.05084, over 16841.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2968, pruned_loss=0.06196, over 28354085.32 frames. ], batch size: 98, lr: 4.07e-03, grad_scale: 32.0 2023-10-11 00:17:30,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=528453.3333333334, ans=0.125 2023-10-11 00:17:43,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=528500.0, ans=0.125 2023-10-11 00:17:55,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=528546.6666666666, ans=0.1 2023-10-11 00:18:00,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.33 vs. limit=15.0 2023-10-11 00:18:16,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=528640.0, ans=0.125 2023-10-11 00:18:17,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=528640.0, ans=0.125 2023-10-11 00:18:44,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.243e+02 1.771e+02 1.999e+02 2.294e+02 3.223e+02, threshold=3.998e+02, percent-clipped=0.0 2023-10-11 00:18:46,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=528733.3333333334, ans=0.05 2023-10-11 00:19:13,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=528873.3333333334, ans=0.1 2023-10-11 00:19:25,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=528920.0, ans=0.125 2023-10-11 00:19:39,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=528966.6666666666, ans=10.0 2023-10-11 00:19:51,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=529013.3333333334, ans=0.125 2023-10-11 00:20:09,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=529106.6666666666, ans=0.0 2023-10-11 00:20:11,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=529106.6666666666, ans=0.04949747468305833 2023-10-11 00:20:14,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=529106.6666666666, ans=0.1 2023-10-11 00:20:38,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=529200.0, ans=0.0 2023-10-11 00:20:40,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.725e+02 1.850e+02 2.121e+02 3.418e+02, threshold=3.700e+02, percent-clipped=0.0 2023-10-11 00:20:59,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=529246.6666666666, ans=0.0 2023-10-11 00:21:02,772 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.80 vs. limit=15.0 2023-10-11 00:21:58,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=529480.0, ans=10.0 2023-10-11 00:21:59,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=529480.0, ans=0.125 2023-10-11 00:22:18,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=529573.3333333334, ans=0.125 2023-10-11 00:22:19,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=529573.3333333334, ans=0.0 2023-10-11 00:22:20,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=529573.3333333334, ans=0.2 2023-10-11 00:22:24,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=529620.0, ans=0.125 2023-10-11 00:22:33,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=529666.6666666666, ans=0.2 2023-10-11 00:22:41,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.711e+02 1.939e+02 2.267e+02 3.169e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-11 00:22:48,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=529713.3333333334, ans=0.125 2023-10-11 00:22:51,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=529713.3333333334, ans=0.125 2023-10-11 00:23:03,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-10-11 00:23:20,537 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.31 vs. limit=15.0 2023-10-11 00:23:40,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=529946.6666666666, ans=0.2 2023-10-11 00:23:41,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=529946.6666666666, ans=0.2 2023-10-11 00:23:58,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=529993.3333333334, ans=0.125 2023-10-11 00:24:03,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=530040.0, ans=0.125 2023-10-11 00:24:16,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=530086.6666666666, ans=0.09899494936611666 2023-10-11 00:24:20,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=530086.6666666666, ans=0.125 2023-10-11 00:24:22,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530086.6666666666, ans=0.1 2023-10-11 00:24:30,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.84 vs. limit=15.0 2023-10-11 00:24:34,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.713e+02 1.889e+02 2.110e+02 2.902e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-11 00:24:53,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=530226.6666666666, ans=0.125 2023-10-11 00:24:56,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=530226.6666666666, ans=0.07 2023-10-11 00:25:12,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530320.0, ans=0.1 2023-10-11 00:25:21,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530366.6666666666, ans=0.1 2023-10-11 00:25:29,256 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.73 vs. limit=22.5 2023-10-11 00:25:29,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=530366.6666666666, ans=0.0 2023-10-11 00:25:40,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=530413.3333333334, ans=0.125 2023-10-11 00:25:41,618 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.71 vs. limit=15.0 2023-10-11 00:26:36,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.735e+02 1.973e+02 2.241e+02 3.564e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-11 00:26:40,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=530646.6666666666, ans=0.1 2023-10-11 00:26:46,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=530646.6666666666, ans=0.125 2023-10-11 00:26:47,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=530646.6666666666, ans=0.125 2023-10-11 00:27:08,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=530740.0, ans=0.125 2023-10-11 00:27:15,078 INFO [train.py:1031] (0/4) Epoch 9, batch 4500, loss[loss=0.1945, simple_loss=0.2834, pruned_loss=0.05277, over 15643.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.297, pruned_loss=0.06174, over 29333032.73 frames. ], batch size: 35, lr: 4.07e-03, grad_scale: 32.0 2023-10-11 00:27:21,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530786.6666666666, ans=0.1 2023-10-11 00:27:38,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.65 vs. limit=15.0 2023-10-11 00:28:15,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=531020.0, ans=0.125 2023-10-11 00:28:18,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531020.0, ans=0.1 2023-10-11 00:28:26,732 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.636e+02 1.778e+02 2.064e+02 2.984e+02, threshold=3.556e+02, percent-clipped=0.0 2023-10-11 00:28:35,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=531113.3333333334, ans=0.125 2023-10-11 00:28:55,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=531206.6666666666, ans=0.125 2023-10-11 00:29:04,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=531253.3333333334, ans=0.0 2023-10-11 00:29:06,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=531253.3333333334, ans=0.125 2023-10-11 00:29:14,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=531300.0, ans=0.0 2023-10-11 00:29:22,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=531300.0, ans=0.2 2023-10-11 00:29:23,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-10-11 00:29:35,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=531393.3333333334, ans=0.0 2023-10-11 00:29:36,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-10-11 00:29:40,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=531393.3333333334, ans=0.125 2023-10-11 00:29:55,717 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.66 vs. limit=22.5 2023-10-11 00:30:07,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=531533.3333333334, ans=0.2 2023-10-11 00:30:13,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.753e+02 1.976e+02 2.319e+02 3.156e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-11 00:30:14,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=531533.3333333334, ans=0.2 2023-10-11 00:30:30,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531626.6666666666, ans=0.1 2023-10-11 00:30:34,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=531626.6666666666, ans=0.2 2023-10-11 00:30:34,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=531626.6666666666, ans=0.125 2023-10-11 00:30:47,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=531673.3333333334, ans=0.125 2023-10-11 00:30:49,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=531720.0, ans=10.0 2023-10-11 00:30:58,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=531720.0, ans=0.2 2023-10-11 00:31:04,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=531766.6666666666, ans=0.2 2023-10-11 00:31:28,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=531860.0, ans=0.125 2023-10-11 00:31:29,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=531860.0, ans=0.125 2023-10-11 00:31:58,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=532000.0, ans=0.2 2023-10-11 00:32:01,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.670e+02 1.842e+02 2.034e+02 4.194e+02, threshold=3.684e+02, percent-clipped=1.0 2023-10-11 00:32:16,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=532093.3333333334, ans=0.0 2023-10-11 00:32:16,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-10-11 00:32:22,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=532093.3333333334, ans=0.5 2023-10-11 00:32:30,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=532140.0, ans=0.0 2023-10-11 00:32:43,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=532186.6666666666, ans=0.125 2023-10-11 00:33:13,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=532326.6666666666, ans=0.125 2023-10-11 00:33:17,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=532326.6666666666, ans=0.0 2023-10-11 00:33:41,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=532420.0, ans=0.0 2023-10-11 00:33:45,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=532420.0, ans=0.035 2023-10-11 00:33:46,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=532466.6666666666, ans=0.125 2023-10-11 00:33:55,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.647e+02 1.780e+02 2.048e+02 3.069e+02, threshold=3.560e+02, percent-clipped=0.0 2023-10-11 00:34:10,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=532560.0, ans=0.0 2023-10-11 00:34:37,295 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=12.0 2023-10-11 00:35:10,923 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.25 vs. limit=15.0 2023-10-11 00:35:20,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=532793.3333333334, ans=0.0 2023-10-11 00:35:47,465 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-10-11 00:35:52,298 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.734e+02 1.925e+02 2.248e+02 3.988e+02, threshold=3.851e+02, percent-clipped=1.0 2023-10-11 00:35:57,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=532980.0, ans=0.2 2023-10-11 00:36:27,511 INFO [train.py:1031] (0/4) Epoch 9, batch 5000, loss[loss=0.2065, simple_loss=0.297, pruned_loss=0.05803, over 16948.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2968, pruned_loss=0.0618, over 30111490.78 frames. ], batch size: 123, lr: 4.06e-03, grad_scale: 32.0 2023-10-11 00:36:40,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533166.6666666666, ans=0.1 2023-10-11 00:36:47,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=533166.6666666666, ans=0.125 2023-10-11 00:36:47,788 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=12.0 2023-10-11 00:36:56,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=533213.3333333334, ans=0.125 2023-10-11 00:36:57,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=533213.3333333334, ans=0.125 2023-10-11 00:37:01,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.77 vs. limit=22.5 2023-10-11 00:37:11,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=533306.6666666666, ans=0.125 2023-10-11 00:37:32,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=533400.0, ans=0.0 2023-10-11 00:37:36,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=533400.0, ans=0.125 2023-10-11 00:37:41,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=533400.0, ans=0.2 2023-10-11 00:37:41,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=533400.0, ans=0.125 2023-10-11 00:37:43,482 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.751e+02 2.057e+02 2.441e+02 4.127e+02, threshold=4.115e+02, percent-clipped=2.0 2023-10-11 00:37:46,924 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-10-11 00:38:00,326 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-10-11 00:38:25,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=533586.6666666666, ans=0.0 2023-10-11 00:38:39,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.55 vs. limit=12.0 2023-10-11 00:39:10,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=533773.3333333334, ans=0.125 2023-10-11 00:39:11,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=533773.3333333334, ans=0.09899494936611666 2023-10-11 00:39:14,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=533773.3333333334, ans=0.125 2023-10-11 00:39:18,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=533820.0, ans=0.125 2023-10-11 00:39:24,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=533820.0, ans=0.125 2023-10-11 00:39:31,185 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-10-11 00:39:36,728 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.690e+02 1.887e+02 2.069e+02 2.882e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-11 00:39:42,476 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:39:42,655 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.94 vs. limit=15.0 2023-10-11 00:39:54,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=533960.0, ans=0.04949747468305833 2023-10-11 00:40:15,976 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.98 vs. limit=22.5 2023-10-11 00:40:23,655 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.47 vs. limit=12.0 2023-10-11 00:40:23,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=534100.0, ans=10.0 2023-10-11 00:40:29,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=534100.0, ans=0.0 2023-10-11 00:40:35,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=534146.6666666666, ans=0.125 2023-10-11 00:40:43,877 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-10-11 00:40:51,675 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.19 vs. limit=15.0 2023-10-11 00:40:54,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=534240.0, ans=0.125 2023-10-11 00:41:30,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.671e+02 1.836e+02 2.065e+02 2.968e+02, threshold=3.673e+02, percent-clipped=0.0 2023-10-11 00:41:31,114 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:41:46,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534426.6666666666, ans=0.0 2023-10-11 00:42:37,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=534660.0, ans=0.0 2023-10-11 00:42:51,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=534660.0, ans=0.125 2023-10-11 00:43:00,752 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.10 vs. limit=22.5 2023-10-11 00:43:08,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=534753.3333333334, ans=0.0 2023-10-11 00:43:21,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=534800.0, ans=0.125 2023-10-11 00:43:24,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.584e+02 1.697e+02 1.936e+02 2.768e+02, threshold=3.394e+02, percent-clipped=0.0 2023-10-11 00:43:35,812 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.17 vs. limit=15.0 2023-10-11 00:43:35,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=12.0 2023-10-11 00:43:38,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=534893.3333333334, ans=0.2 2023-10-11 00:43:58,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=534986.6666666666, ans=0.0 2023-10-11 00:43:59,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=534986.6666666666, ans=0.0 2023-10-11 00:44:10,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=535033.3333333334, ans=0.125 2023-10-11 00:44:18,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=535080.0, ans=0.125 2023-10-11 00:44:28,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=535080.0, ans=0.125 2023-10-11 00:44:34,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=535126.6666666666, ans=0.125 2023-10-11 00:44:35,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=535126.6666666666, ans=0.0 2023-10-11 00:44:47,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=535173.3333333334, ans=0.125 2023-10-11 00:44:53,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=535220.0, ans=0.125 2023-10-11 00:44:55,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=535220.0, ans=0.1 2023-10-11 00:45:06,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=535266.6666666666, ans=0.125 2023-10-11 00:45:11,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=535266.6666666666, ans=0.125 2023-10-11 00:45:14,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=535266.6666666666, ans=0.2 2023-10-11 00:45:14,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.679e+02 1.959e+02 2.239e+02 3.797e+02, threshold=3.917e+02, percent-clipped=2.0 2023-10-11 00:45:15,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.96 vs. limit=22.5 2023-10-11 00:45:16,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=535313.3333333334, ans=0.1 2023-10-11 00:45:24,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=535313.3333333334, ans=0.0 2023-10-11 00:45:24,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.01 vs. limit=22.5 2023-10-11 00:45:33,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=535360.0, ans=0.125 2023-10-11 00:45:35,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=535406.6666666666, ans=0.125 2023-10-11 00:45:47,517 INFO [train.py:1031] (0/4) Epoch 9, batch 5500, loss[loss=0.1904, simple_loss=0.2827, pruned_loss=0.04904, over 16842.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2965, pruned_loss=0.06159, over 30695867.12 frames. ], batch size: 98, lr: 4.05e-03, grad_scale: 16.0 2023-10-11 00:45:59,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=535500.0, ans=0.0 2023-10-11 00:45:59,699 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.24 vs. limit=10.0 2023-10-11 00:46:01,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=535500.0, ans=0.04949747468305833 2023-10-11 00:46:05,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=535500.0, ans=0.125 2023-10-11 00:46:16,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=535546.6666666666, ans=0.2 2023-10-11 00:46:20,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=535593.3333333334, ans=0.125 2023-10-11 00:46:22,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-10-11 00:46:36,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=535640.0, ans=0.125 2023-10-11 00:46:42,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=535686.6666666666, ans=0.125 2023-10-11 00:47:03,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.583e+02 1.694e+02 1.878e+02 3.011e+02, threshold=3.388e+02, percent-clipped=0.0 2023-10-11 00:47:10,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=535780.0, ans=0.125 2023-10-11 00:47:33,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=535873.3333333334, ans=0.0 2023-10-11 00:48:08,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=536013.3333333334, ans=0.125 2023-10-11 00:48:26,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=536106.6666666666, ans=0.125 2023-10-11 00:48:47,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=536200.0, ans=0.125 2023-10-11 00:48:51,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=536200.0, ans=0.015 2023-10-11 00:48:54,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=536200.0, ans=0.5 2023-10-11 00:48:56,351 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.780e+02 1.998e+02 2.427e+02 3.170e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-11 00:49:44,246 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-10-11 00:50:02,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.46 vs. limit=15.0 2023-10-11 00:50:11,125 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.48 vs. limit=22.5 2023-10-11 00:50:15,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=536573.3333333334, ans=0.1 2023-10-11 00:50:50,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.652e+02 1.816e+02 2.002e+02 2.861e+02, threshold=3.633e+02, percent-clipped=0.0 2023-10-11 00:51:19,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=536806.6666666666, ans=0.125 2023-10-11 00:51:41,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=536900.0, ans=0.125 2023-10-11 00:51:50,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-10-11 00:51:51,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=536946.6666666666, ans=0.0 2023-10-11 00:51:51,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=536946.6666666666, ans=0.1 2023-10-11 00:52:00,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=536993.3333333334, ans=0.015 2023-10-11 00:52:08,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=536993.3333333334, ans=0.125 2023-10-11 00:52:14,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=537040.0, ans=0.0 2023-10-11 00:52:20,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=537040.0, ans=0.125 2023-10-11 00:52:27,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.03 vs. limit=15.0 2023-10-11 00:52:47,262 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.661e+02 1.841e+02 2.063e+02 2.970e+02, threshold=3.683e+02, percent-clipped=0.0 2023-10-11 00:52:48,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=537180.0, ans=0.1 2023-10-11 00:53:03,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=537226.6666666666, ans=10.0 2023-10-11 00:53:23,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=537320.0, ans=0.125 2023-10-11 00:53:35,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=537366.6666666666, ans=0.125 2023-10-11 00:53:49,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=537413.3333333334, ans=0.07 2023-10-11 00:54:11,923 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.15 vs. limit=15.0 2023-10-11 00:54:26,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=537553.3333333334, ans=0.035 2023-10-11 00:54:39,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.684e+02 1.834e+02 2.014e+02 3.437e+02, threshold=3.669e+02, percent-clipped=0.0 2023-10-11 00:55:04,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=537740.0, ans=0.04949747468305833 2023-10-11 00:55:15,385 INFO [train.py:1031] (0/4) Epoch 9, batch 6000, loss[loss=0.2232, simple_loss=0.308, pruned_loss=0.0692, over 16901.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2967, pruned_loss=0.06177, over 31133616.73 frames. ], batch size: 116, lr: 4.04e-03, grad_scale: 32.0 2023-10-11 00:55:32,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-11 00:55:38,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=537880.0, ans=0.125 2023-10-11 00:55:38,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=537880.0, ans=0.0 2023-10-11 00:55:39,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=537880.0, ans=0.125 2023-10-11 00:56:02,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=537973.3333333334, ans=0.0 2023-10-11 00:56:33,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=538066.6666666666, ans=0.125 2023-10-11 00:56:35,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.327e+02 1.745e+02 1.985e+02 2.243e+02 3.608e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-11 00:56:47,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-10-11 00:56:56,517 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 00:57:06,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=538253.3333333334, ans=0.125 2023-10-11 00:58:20,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.637e+02 1.787e+02 2.010e+02 2.623e+02, threshold=3.574e+02, percent-clipped=0.0 2023-10-11 00:58:31,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=538580.0, ans=0.125 2023-10-11 00:58:43,449 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-10-11 00:58:48,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=538673.3333333334, ans=0.2 2023-10-11 00:58:52,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=538673.3333333334, ans=0.125 2023-10-11 00:59:03,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=538720.0, ans=0.125 2023-10-11 00:59:24,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=538813.3333333334, ans=0.0 2023-10-11 00:59:31,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=538860.0, ans=0.125 2023-10-11 00:59:41,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=538906.6666666666, ans=0.0 2023-10-11 01:00:12,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.783e+02 1.870e+02 2.083e+02 2.738e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-11 01:00:12,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=539046.6666666666, ans=0.05 2023-10-11 01:00:44,214 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:00:51,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=539186.6666666666, ans=0.125 2023-10-11 01:00:57,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.78 vs. limit=22.5 2023-10-11 01:01:00,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.06 vs. limit=15.0 2023-10-11 01:01:04,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-10-11 01:01:07,579 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-11 01:01:34,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=539373.3333333334, ans=0.07 2023-10-11 01:01:56,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=539466.6666666666, ans=0.125 2023-10-11 01:02:12,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.711e+02 1.934e+02 2.190e+02 3.211e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-11 01:02:35,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=539606.6666666666, ans=0.125 2023-10-11 01:02:44,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=539606.6666666666, ans=0.2 2023-10-11 01:02:45,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=539606.6666666666, ans=0.125 2023-10-11 01:02:46,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=539606.6666666666, ans=0.125 2023-10-11 01:02:48,416 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:02:55,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=539653.3333333334, ans=0.125 2023-10-11 01:03:22,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=539793.3333333334, ans=0.125 2023-10-11 01:03:24,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=539793.3333333334, ans=0.125 2023-10-11 01:03:31,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=539840.0, ans=0.125 2023-10-11 01:03:35,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.04 vs. limit=22.5 2023-10-11 01:03:36,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=15.0 2023-10-11 01:03:38,645 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-10-11 01:03:42,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=539886.6666666666, ans=0.0 2023-10-11 01:03:49,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=539886.6666666666, ans=0.1 2023-10-11 01:03:57,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=539933.3333333334, ans=0.1 2023-10-11 01:04:03,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.58 vs. limit=22.5 2023-10-11 01:04:03,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.293e+02 1.699e+02 1.867e+02 2.102e+02 2.991e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-11 01:04:17,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=540026.6666666666, ans=0.125 2023-10-11 01:04:38,223 INFO [train.py:1031] (0/4) Epoch 9, batch 6500, loss[loss=0.2249, simple_loss=0.3092, pruned_loss=0.0703, over 15984.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2971, pruned_loss=0.06188, over 31499240.57 frames. ], batch size: 43, lr: 4.03e-03, grad_scale: 32.0 2023-10-11 01:05:25,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=540260.0, ans=0.2 2023-10-11 01:05:35,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.69 vs. limit=12.0 2023-10-11 01:05:44,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=540353.3333333334, ans=0.0 2023-10-11 01:06:08,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.718e+02 1.914e+02 2.137e+02 2.962e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-11 01:06:13,074 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.92 vs. limit=15.0 2023-10-11 01:06:23,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=540493.3333333334, ans=0.125 2023-10-11 01:06:44,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=540586.6666666666, ans=0.0 2023-10-11 01:07:05,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=540680.0, ans=0.0 2023-10-11 01:07:26,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=540773.3333333334, ans=0.125 2023-10-11 01:07:59,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.691e+02 1.919e+02 2.307e+02 3.643e+02, threshold=3.837e+02, percent-clipped=0.0 2023-10-11 01:08:02,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=540913.3333333334, ans=0.125 2023-10-11 01:08:08,877 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-10-11 01:08:23,099 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.40 vs. limit=5.0 2023-10-11 01:08:45,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=541100.0, ans=0.125 2023-10-11 01:08:46,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=541100.0, ans=0.125 2023-10-11 01:09:22,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=541240.0, ans=0.125 2023-10-11 01:09:29,503 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-10-11 01:09:46,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=541333.3333333334, ans=0.125 2023-10-11 01:09:51,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.618e+02 1.768e+02 1.941e+02 2.761e+02, threshold=3.535e+02, percent-clipped=0.0 2023-10-11 01:09:55,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=541380.0, ans=0.125 2023-10-11 01:09:59,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=541380.0, ans=0.5 2023-10-11 01:10:10,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=541426.6666666666, ans=0.125 2023-10-11 01:10:43,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=541520.0, ans=0.125 2023-10-11 01:10:43,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=541520.0, ans=0.0 2023-10-11 01:11:05,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=541613.3333333334, ans=0.0 2023-10-11 01:11:43,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=541753.3333333334, ans=0.125 2023-10-11 01:11:52,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=541800.0, ans=0.2 2023-10-11 01:12:02,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.682e+02 1.850e+02 2.528e+02 4.430e+02, threshold=3.701e+02, percent-clipped=5.0 2023-10-11 01:12:07,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=541846.6666666666, ans=0.1 2023-10-11 01:12:18,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=541893.3333333334, ans=0.2 2023-10-11 01:12:25,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=541940.0, ans=0.125 2023-10-11 01:12:30,621 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.14 vs. limit=22.5 2023-10-11 01:12:40,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=541986.6666666666, ans=0.09899494936611666 2023-10-11 01:12:47,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=542033.3333333334, ans=0.0 2023-10-11 01:13:05,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=542080.0, ans=0.125 2023-10-11 01:13:08,833 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.40 vs. limit=22.5 2023-10-11 01:13:19,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.19 vs. limit=12.0 2023-10-11 01:13:22,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=542173.3333333334, ans=0.2 2023-10-11 01:13:40,653 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=22.5 2023-10-11 01:13:42,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=542266.6666666666, ans=0.0 2023-10-11 01:13:52,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=542266.6666666666, ans=15.0 2023-10-11 01:13:54,610 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.631e+02 1.824e+02 1.992e+02 2.818e+02, threshold=3.648e+02, percent-clipped=0.0 2023-10-11 01:14:02,671 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-10-11 01:14:06,218 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.952e-02 2023-10-11 01:14:12,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542360.0, ans=0.1 2023-10-11 01:14:15,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=542406.6666666666, ans=0.0 2023-10-11 01:14:21,644 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.87 vs. limit=15.0 2023-10-11 01:14:23,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=542406.6666666666, ans=0.125 2023-10-11 01:14:25,712 INFO [train.py:1031] (0/4) Epoch 9, batch 7000, loss[loss=0.227, simple_loss=0.307, pruned_loss=0.07351, over 16933.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2976, pruned_loss=0.06174, over 31792243.66 frames. ], batch size: 130, lr: 4.02e-03, grad_scale: 32.0 2023-10-11 01:14:36,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=542500.0, ans=0.2 2023-10-11 01:14:38,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=542500.0, ans=0.125 2023-10-11 01:14:53,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=542546.6666666666, ans=0.0 2023-10-11 01:14:53,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=542546.6666666666, ans=0.125 2023-10-11 01:15:10,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=542593.3333333334, ans=0.125 2023-10-11 01:15:21,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=542640.0, ans=0.125 2023-10-11 01:15:46,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=542733.3333333334, ans=0.125 2023-10-11 01:15:50,751 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.742e+02 1.936e+02 2.147e+02 2.999e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-11 01:16:01,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=542826.6666666666, ans=0.0 2023-10-11 01:16:03,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=542826.6666666666, ans=0.125 2023-10-11 01:16:13,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=542873.3333333334, ans=0.125 2023-10-11 01:16:30,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=542920.0, ans=0.125 2023-10-11 01:16:38,959 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.94 vs. limit=6.0 2023-10-11 01:16:57,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=543013.3333333334, ans=0.0 2023-10-11 01:17:11,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=543106.6666666666, ans=0.125 2023-10-11 01:17:23,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.25 vs. limit=22.5 2023-10-11 01:17:34,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=543200.0, ans=0.1 2023-10-11 01:17:39,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=543200.0, ans=0.0 2023-10-11 01:17:39,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=543200.0, ans=0.0 2023-10-11 01:17:42,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=543246.6666666666, ans=0.0 2023-10-11 01:17:44,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=543246.6666666666, ans=0.5 2023-10-11 01:17:45,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.725e+02 1.902e+02 2.077e+02 3.151e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-11 01:18:06,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=543340.0, ans=0.125 2023-10-11 01:18:07,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=543340.0, ans=0.1 2023-10-11 01:18:19,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.59 vs. limit=15.0 2023-10-11 01:18:23,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=543386.6666666666, ans=0.0 2023-10-11 01:18:25,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=543386.6666666666, ans=0.05 2023-10-11 01:18:27,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-10-11 01:18:44,909 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-10-11 01:18:55,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=543480.0, ans=0.125 2023-10-11 01:19:20,800 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.19 vs. limit=10.0 2023-10-11 01:19:33,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=543620.0, ans=0.125 2023-10-11 01:19:34,447 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-10-11 01:19:55,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.672e+02 1.862e+02 2.136e+02 3.132e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-11 01:19:58,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=543713.3333333334, ans=0.0 2023-10-11 01:20:00,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=543713.3333333334, ans=0.125 2023-10-11 01:20:22,683 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.70 vs. limit=10.0 2023-10-11 01:20:33,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=543853.3333333334, ans=0.2 2023-10-11 01:20:47,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=543900.0, ans=0.0 2023-10-11 01:21:27,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544086.6666666666, ans=0.1 2023-10-11 01:21:27,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=544086.6666666666, ans=15.0 2023-10-11 01:21:35,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544086.6666666666, ans=0.1 2023-10-11 01:21:49,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.214e+02 1.640e+02 1.778e+02 2.042e+02 2.909e+02, threshold=3.557e+02, percent-clipped=0.0 2023-10-11 01:21:51,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=544180.0, ans=0.125 2023-10-11 01:21:52,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=544180.0, ans=0.0 2023-10-11 01:22:01,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=544226.6666666666, ans=0.0 2023-10-11 01:22:08,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=544226.6666666666, ans=0.125 2023-10-11 01:22:21,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=544273.3333333334, ans=15.0 2023-10-11 01:22:30,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=544320.0, ans=0.0 2023-10-11 01:22:35,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=544366.6666666666, ans=0.125 2023-10-11 01:22:43,685 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.42 vs. limit=15.0 2023-10-11 01:22:45,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544413.3333333334, ans=0.1 2023-10-11 01:22:52,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544413.3333333334, ans=0.1 2023-10-11 01:23:00,255 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:23:06,595 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.57 vs. limit=15.0 2023-10-11 01:23:15,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.35 vs. limit=12.0 2023-10-11 01:23:34,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=544600.0, ans=0.0 2023-10-11 01:23:37,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.17 vs. limit=15.0 2023-10-11 01:23:40,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.790e+02 1.905e+02 2.110e+02 4.515e+02, threshold=3.809e+02, percent-clipped=1.0 2023-10-11 01:23:49,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.46 vs. limit=15.0 2023-10-11 01:24:13,463 INFO [train.py:1031] (0/4) Epoch 9, batch 7500, loss[loss=0.2012, simple_loss=0.2857, pruned_loss=0.05839, over 16653.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2973, pruned_loss=0.06167, over 32008955.67 frames. ], batch size: 61, lr: 4.01e-03, grad_scale: 32.0 2023-10-11 01:24:47,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=544926.6666666666, ans=0.125 2023-10-11 01:24:54,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=544926.6666666666, ans=0.0 2023-10-11 01:25:03,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=544973.3333333334, ans=0.125 2023-10-11 01:25:27,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=545066.6666666666, ans=0.125 2023-10-11 01:25:30,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=545066.6666666666, ans=0.125 2023-10-11 01:25:33,455 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.735e+02 1.937e+02 2.293e+02 3.427e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-11 01:25:40,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=545113.3333333334, ans=0.125 2023-10-11 01:25:40,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.48 vs. limit=15.0 2023-10-11 01:25:48,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545160.0, ans=0.1 2023-10-11 01:25:48,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=545160.0, ans=0.0 2023-10-11 01:25:49,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=545160.0, ans=0.2 2023-10-11 01:25:51,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545160.0, ans=0.1 2023-10-11 01:25:55,872 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:26:22,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=545300.0, ans=0.0 2023-10-11 01:26:37,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545346.6666666666, ans=0.1 2023-10-11 01:26:58,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=545393.3333333334, ans=0.125 2023-10-11 01:27:06,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=545440.0, ans=0.125 2023-10-11 01:27:08,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545440.0, ans=0.1 2023-10-11 01:27:36,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.646e+02 1.821e+02 2.108e+02 2.790e+02, threshold=3.641e+02, percent-clipped=0.0 2023-10-11 01:27:36,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=545580.0, ans=0.0 2023-10-11 01:28:01,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-10-11 01:28:14,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=545720.0, ans=0.125 2023-10-11 01:28:17,758 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-10-11 01:28:29,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545813.3333333334, ans=0.1 2023-10-11 01:28:30,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.11 vs. limit=15.0 2023-10-11 01:28:51,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=545906.6666666666, ans=0.125 2023-10-11 01:29:09,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=545953.3333333334, ans=0.125 2023-10-11 01:29:12,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=546000.0, ans=0.125 2023-10-11 01:29:25,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.680e+02 1.830e+02 2.088e+02 3.460e+02, threshold=3.661e+02, percent-clipped=0.0 2023-10-11 01:29:58,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=546186.6666666666, ans=0.035 2023-10-11 01:30:05,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.58 vs. limit=10.0 2023-10-11 01:30:12,671 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.226e-03 2023-10-11 01:30:58,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=546420.0, ans=0.125 2023-10-11 01:31:10,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=546466.6666666666, ans=0.125 2023-10-11 01:31:12,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=546466.6666666666, ans=0.125 2023-10-11 01:31:13,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=546466.6666666666, ans=0.125 2023-10-11 01:31:14,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.60 vs. limit=15.0 2023-10-11 01:31:20,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.733e+02 1.916e+02 2.079e+02 2.951e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-11 01:31:41,579 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.61 vs. limit=15.0 2023-10-11 01:31:52,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=546653.3333333334, ans=0.0 2023-10-11 01:31:53,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=546653.3333333334, ans=0.1 2023-10-11 01:32:00,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=546653.3333333334, ans=0.07 2023-10-11 01:32:19,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=546746.6666666666, ans=0.125 2023-10-11 01:33:03,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=546933.3333333334, ans=0.09899494936611666 2023-10-11 01:33:11,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=546933.3333333334, ans=0.125 2023-10-11 01:33:17,719 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.671e+02 1.915e+02 2.095e+02 2.809e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-11 01:33:23,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=546980.0, ans=0.0 2023-10-11 01:33:46,914 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.44 vs. limit=15.0 2023-10-11 01:33:49,998 INFO [train.py:1031] (0/4) Epoch 9, batch 8000, loss[loss=0.1902, simple_loss=0.2789, pruned_loss=0.05072, over 16442.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2966, pruned_loss=0.06097, over 32193250.96 frames. ], batch size: 50, lr: 4.00e-03, grad_scale: 32.0 2023-10-11 01:34:12,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.36 vs. limit=15.0 2023-10-11 01:34:20,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=547213.3333333334, ans=0.125 2023-10-11 01:34:28,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-11 01:34:30,110 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-11 01:34:46,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=547353.3333333334, ans=0.0 2023-10-11 01:34:54,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=547400.0, ans=0.125 2023-10-11 01:34:57,017 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.76 vs. limit=10.0 2023-10-11 01:35:01,577 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.60 vs. limit=10.0 2023-10-11 01:35:05,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=547446.6666666666, ans=0.125 2023-10-11 01:35:06,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.270e+02 1.632e+02 1.817e+02 2.096e+02 2.954e+02, threshold=3.634e+02, percent-clipped=0.0 2023-10-11 01:35:11,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.52 vs. limit=10.0 2023-10-11 01:35:11,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547446.6666666666, ans=0.1 2023-10-11 01:35:16,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-10-11 01:35:16,075 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-10-11 01:35:34,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.92 vs. limit=22.5 2023-10-11 01:35:50,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=547633.3333333334, ans=0.125 2023-10-11 01:35:51,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=547633.3333333334, ans=0.0 2023-10-11 01:36:49,306 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.68 vs. limit=15.0 2023-10-11 01:36:55,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=547866.6666666666, ans=0.0 2023-10-11 01:37:13,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.697e+02 1.834e+02 2.076e+02 3.082e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-11 01:37:14,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=547913.3333333334, ans=0.1 2023-10-11 01:37:30,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=547960.0, ans=0.07 2023-10-11 01:37:42,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=548006.6666666666, ans=0.125 2023-10-11 01:38:07,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=548100.0, ans=0.125 2023-10-11 01:38:07,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=548100.0, ans=0.125 2023-10-11 01:38:31,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=548193.3333333334, ans=0.2 2023-10-11 01:38:32,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=548193.3333333334, ans=0.1 2023-10-11 01:38:35,045 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.62 vs. limit=15.0 2023-10-11 01:38:35,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-10-11 01:39:12,385 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.774e+02 1.966e+02 2.237e+02 3.257e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-11 01:39:22,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=548426.6666666666, ans=0.125 2023-10-11 01:39:28,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=548426.6666666666, ans=0.1 2023-10-11 01:39:44,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=548473.3333333334, ans=0.0 2023-10-11 01:40:00,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=548566.6666666666, ans=0.0 2023-10-11 01:40:03,187 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.48 vs. limit=22.5 2023-10-11 01:40:05,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=548566.6666666666, ans=0.5 2023-10-11 01:40:07,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=548566.6666666666, ans=0.2 2023-10-11 01:40:48,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=548753.3333333334, ans=0.125 2023-10-11 01:40:55,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=548800.0, ans=0.125 2023-10-11 01:41:04,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.690e+02 1.883e+02 2.031e+02 2.895e+02, threshold=3.766e+02, percent-clipped=0.0 2023-10-11 01:41:49,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=549033.3333333334, ans=0.0 2023-10-11 01:42:13,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=549126.6666666666, ans=0.125 2023-10-11 01:42:16,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=549126.6666666666, ans=0.125 2023-10-11 01:42:33,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=549173.3333333334, ans=0.09899494936611666 2023-10-11 01:42:40,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=549220.0, ans=0.95 2023-10-11 01:42:47,262 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.34 vs. limit=15.0 2023-10-11 01:42:58,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.780e+02 1.969e+02 2.317e+02 3.856e+02, threshold=3.938e+02, percent-clipped=1.0 2023-10-11 01:43:01,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=549313.3333333334, ans=0.1 2023-10-11 01:43:15,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=549360.0, ans=0.05 2023-10-11 01:43:22,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=549406.6666666666, ans=0.2 2023-10-11 01:43:36,482 INFO [train.py:1031] (0/4) Epoch 9, batch 8500, loss[loss=0.2506, simple_loss=0.3232, pruned_loss=0.08906, over 16044.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2966, pruned_loss=0.06072, over 32324025.21 frames. ], batch size: 296, lr: 4.00e-03, grad_scale: 64.0 2023-10-11 01:43:38,143 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.75 vs. limit=15.0 2023-10-11 01:43:40,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=549453.3333333334, ans=0.125 2023-10-11 01:43:45,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=549453.3333333334, ans=0.0 2023-10-11 01:43:49,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=549500.0, ans=0.2 2023-10-11 01:43:49,579 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.04 vs. limit=15.0 2023-10-11 01:43:50,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=549500.0, ans=0.1 2023-10-11 01:44:04,532 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:44:04,961 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=12.0 2023-10-11 01:44:59,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.746e+02 1.893e+02 2.167e+02 3.169e+02, threshold=3.787e+02, percent-clipped=0.0 2023-10-11 01:45:02,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=549780.0, ans=0.1 2023-10-11 01:45:04,219 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.67 vs. limit=22.5 2023-10-11 01:45:25,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=549873.3333333334, ans=0.125 2023-10-11 01:45:28,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.09 vs. limit=15.0 2023-10-11 01:45:33,208 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-10-11 01:45:42,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=549920.0, ans=0.05 2023-10-11 01:45:50,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=549966.6666666666, ans=0.125 2023-10-11 01:46:01,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=550013.3333333334, ans=0.0 2023-10-11 01:46:04,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=550013.3333333334, ans=0.0 2023-10-11 01:46:06,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=550013.3333333334, ans=0.125 2023-10-11 01:46:13,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=550060.0, ans=0.1 2023-10-11 01:46:38,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=550153.3333333334, ans=0.2 2023-10-11 01:46:38,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=550153.3333333334, ans=0.0 2023-10-11 01:46:56,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=550200.0, ans=0.0 2023-10-11 01:46:58,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=550200.0, ans=0.2 2023-10-11 01:47:02,663 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.646e+02 1.777e+02 1.978e+02 2.733e+02, threshold=3.554e+02, percent-clipped=0.0 2023-10-11 01:47:19,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=550293.3333333334, ans=0.1 2023-10-11 01:47:20,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=550293.3333333334, ans=0.2 2023-10-11 01:47:21,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=550293.3333333334, ans=0.125 2023-10-11 01:47:21,947 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.15 vs. limit=22.5 2023-10-11 01:47:30,305 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.37 vs. limit=15.0 2023-10-11 01:47:45,880 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.629e-03 2023-10-11 01:47:50,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=550386.6666666666, ans=0.125 2023-10-11 01:48:09,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=550480.0, ans=0.1 2023-10-11 01:48:32,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=550573.3333333334, ans=0.0 2023-10-11 01:48:34,134 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:48:48,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=550620.0, ans=0.125 2023-10-11 01:49:10,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.592e+02 1.763e+02 1.986e+02 2.897e+02, threshold=3.527e+02, percent-clipped=0.0 2023-10-11 01:49:26,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=550760.0, ans=0.125 2023-10-11 01:49:28,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=550760.0, ans=0.125 2023-10-11 01:49:59,674 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.47 vs. limit=12.0 2023-10-11 01:50:17,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=550993.3333333334, ans=0.125 2023-10-11 01:50:32,097 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:50:51,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=551133.3333333334, ans=0.1 2023-10-11 01:50:57,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.805e+02 2.003e+02 2.407e+02 3.310e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-11 01:51:04,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=551180.0, ans=0.0 2023-10-11 01:51:29,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=551320.0, ans=0.125 2023-10-11 01:51:58,107 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.13 vs. limit=10.0 2023-10-11 01:52:38,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=551600.0, ans=0.125 2023-10-11 01:52:47,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.702e+02 1.922e+02 2.157e+02 2.775e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-11 01:52:58,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.91 vs. limit=22.5 2023-10-11 01:53:19,788 INFO [train.py:1031] (0/4) Epoch 9, batch 9000, loss[loss=0.2125, simple_loss=0.2996, pruned_loss=0.06267, over 15643.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2959, pruned_loss=0.06055, over 32401723.05 frames. ], batch size: 36, lr: 3.99e-03, grad_scale: 64.0 2023-10-11 01:53:21,262 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.01 vs. limit=15.0 2023-10-11 01:53:28,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=551786.6666666666, ans=0.125 2023-10-11 01:53:37,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=551833.3333333334, ans=0.0 2023-10-11 01:53:58,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=551926.6666666666, ans=0.2 2023-10-11 01:54:05,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=551973.3333333334, ans=0.025 2023-10-11 01:54:19,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.76 vs. limit=22.5 2023-10-11 01:54:32,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=552066.6666666666, ans=0.125 2023-10-11 01:54:35,810 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-10-11 01:54:37,372 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.649e+02 1.808e+02 1.972e+02 2.800e+02, threshold=3.615e+02, percent-clipped=0.0 2023-10-11 01:54:38,068 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.57 vs. limit=12.0 2023-10-11 01:54:38,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=552113.3333333334, ans=0.125 2023-10-11 01:55:07,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.64 vs. limit=15.0 2023-10-11 01:55:09,629 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 01:55:10,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.26 vs. limit=12.0 2023-10-11 01:55:22,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=552300.0, ans=0.2 2023-10-11 01:55:23,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=552300.0, ans=0.04949747468305833 2023-10-11 01:55:28,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=552346.6666666666, ans=0.0 2023-10-11 01:55:33,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=552346.6666666666, ans=0.0 2023-10-11 01:55:43,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=552393.3333333334, ans=0.125 2023-10-11 01:56:24,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.812e+02 1.991e+02 2.323e+02 3.468e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-11 01:56:29,915 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-10-11 01:56:34,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=552626.6666666666, ans=0.0 2023-10-11 01:56:35,791 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-10-11 01:56:39,529 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.00 vs. limit=15.0 2023-10-11 01:56:56,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552720.0, ans=0.1 2023-10-11 01:57:06,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=552720.0, ans=0.025 2023-10-11 01:57:16,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=552766.6666666666, ans=0.125 2023-10-11 01:57:19,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=552813.3333333334, ans=0.1 2023-10-11 01:57:19,974 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.08 vs. limit=15.0 2023-10-11 01:57:20,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=552813.3333333334, ans=0.2 2023-10-11 01:57:25,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=552813.3333333334, ans=0.125 2023-10-11 01:57:36,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=552860.0, ans=0.0 2023-10-11 01:58:13,096 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.716e+02 1.880e+02 2.089e+02 2.681e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-11 01:58:16,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=553046.6666666666, ans=0.125 2023-10-11 01:59:35,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=553373.3333333334, ans=0.125 2023-10-11 01:59:36,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=553373.3333333334, ans=0.125 2023-10-11 01:59:49,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=553420.0, ans=0.0 2023-10-11 02:00:09,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=553513.3333333334, ans=0.0 2023-10-11 02:00:12,071 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.782e+02 1.935e+02 2.195e+02 3.246e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-11 02:00:50,171 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.82 vs. limit=22.5 2023-10-11 02:01:17,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=553793.3333333334, ans=0.125 2023-10-11 02:01:18,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=553793.3333333334, ans=0.5 2023-10-11 02:01:22,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=553793.3333333334, ans=0.125 2023-10-11 02:01:38,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=553840.0, ans=0.0 2023-10-11 02:01:40,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=553840.0, ans=0.0 2023-10-11 02:01:43,935 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.07 vs. limit=10.0 2023-10-11 02:02:05,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=553933.3333333334, ans=0.0 2023-10-11 02:02:11,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.713e+02 1.908e+02 2.250e+02 3.753e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-11 02:02:13,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=553980.0, ans=0.0 2023-10-11 02:02:24,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=554026.6666666666, ans=0.2 2023-10-11 02:02:32,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.08 vs. limit=15.0 2023-10-11 02:02:43,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554120.0, ans=0.1 2023-10-11 02:02:43,921 INFO [train.py:1031] (0/4) Epoch 9, batch 9500, loss[loss=0.2406, simple_loss=0.3172, pruned_loss=0.082, over 15635.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2968, pruned_loss=0.06104, over 32448875.94 frames. ], batch size: 350, lr: 3.98e-03, grad_scale: 16.0 2023-10-11 02:02:55,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=554166.6666666666, ans=0.0 2023-10-11 02:03:01,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554166.6666666666, ans=0.1 2023-10-11 02:03:07,694 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-10-11 02:03:08,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=554213.3333333334, ans=10.0 2023-10-11 02:03:26,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=554306.6666666666, ans=0.0 2023-10-11 02:03:41,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=554353.3333333334, ans=0.125 2023-10-11 02:03:51,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=554400.0, ans=0.0 2023-10-11 02:03:55,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.65 vs. limit=5.0 2023-10-11 02:04:01,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.35 vs. limit=15.0 2023-10-11 02:04:03,663 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.642e+02 1.783e+02 1.987e+02 2.928e+02, threshold=3.566e+02, percent-clipped=0.0 2023-10-11 02:04:05,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=554446.6666666666, ans=0.125 2023-10-11 02:04:13,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=554493.3333333334, ans=0.1 2023-10-11 02:04:32,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554586.6666666666, ans=0.1 2023-10-11 02:04:53,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=554633.3333333334, ans=0.0 2023-10-11 02:04:56,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=554680.0, ans=0.125 2023-10-11 02:04:59,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=554680.0, ans=0.2 2023-10-11 02:05:17,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=554773.3333333334, ans=0.05 2023-10-11 02:05:19,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=554773.3333333334, ans=0.2 2023-10-11 02:05:22,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=554773.3333333334, ans=0.0 2023-10-11 02:05:36,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=554820.0, ans=0.1 2023-10-11 02:05:53,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=554913.3333333334, ans=0.125 2023-10-11 02:05:57,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.43 vs. limit=15.0 2023-10-11 02:05:57,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.693e+02 1.868e+02 2.087e+02 2.830e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-11 02:06:03,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=554960.0, ans=0.125 2023-10-11 02:06:04,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=554960.0, ans=0.05 2023-10-11 02:06:11,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=554960.0, ans=0.0 2023-10-11 02:06:17,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=555006.6666666666, ans=0.0 2023-10-11 02:06:21,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=555006.6666666666, ans=0.2 2023-10-11 02:06:23,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=555006.6666666666, ans=0.2 2023-10-11 02:06:37,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=555100.0, ans=0.1 2023-10-11 02:07:05,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=555193.3333333334, ans=0.125 2023-10-11 02:07:24,059 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:07:47,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.652e+02 1.883e+02 2.236e+02 3.802e+02, threshold=3.766e+02, percent-clipped=1.0 2023-10-11 02:07:48,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=555380.0, ans=0.125 2023-10-11 02:08:28,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=555520.0, ans=10.0 2023-10-11 02:08:41,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=555566.6666666666, ans=0.0 2023-10-11 02:08:44,913 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.44 vs. limit=15.0 2023-10-11 02:08:46,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=555613.3333333334, ans=0.0 2023-10-11 02:08:56,188 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.68 vs. limit=12.0 2023-10-11 02:09:14,265 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.53 vs. limit=15.0 2023-10-11 02:09:24,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=555753.3333333334, ans=0.125 2023-10-11 02:09:32,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=555800.0, ans=0.0 2023-10-11 02:09:41,900 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.700e+02 1.941e+02 2.223e+02 3.027e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-11 02:09:51,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=555893.3333333334, ans=0.125 2023-10-11 02:10:04,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=555940.0, ans=0.125 2023-10-11 02:10:12,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=555986.6666666666, ans=0.02 2023-10-11 02:10:12,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=555986.6666666666, ans=0.1 2023-10-11 02:10:39,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=556080.0, ans=0.0 2023-10-11 02:10:46,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=556126.6666666666, ans=0.125 2023-10-11 02:10:50,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.81 vs. limit=15.0 2023-10-11 02:10:52,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=556126.6666666666, ans=0.1 2023-10-11 02:10:54,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=556173.3333333334, ans=0.125 2023-10-11 02:10:55,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=556173.3333333334, ans=0.125 2023-10-11 02:10:58,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=556173.3333333334, ans=0.125 2023-10-11 02:11:17,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=556266.6666666666, ans=0.0 2023-10-11 02:11:30,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.718e+02 1.934e+02 2.146e+02 3.258e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-11 02:11:34,031 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.11 vs. limit=15.0 2023-10-11 02:11:58,989 INFO [train.py:1031] (0/4) Epoch 9, batch 10000, loss[loss=0.2892, simple_loss=0.3343, pruned_loss=0.1221, over 15563.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2959, pruned_loss=0.06075, over 32517492.64 frames. ], batch size: 350, lr: 3.97e-03, grad_scale: 32.0 2023-10-11 02:11:59,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=556453.3333333334, ans=0.0 2023-10-11 02:12:03,752 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-10-11 02:12:06,313 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-10-11 02:12:11,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=556500.0, ans=0.2 2023-10-11 02:12:17,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=556500.0, ans=0.0 2023-10-11 02:12:22,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=556546.6666666666, ans=0.125 2023-10-11 02:12:37,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=556593.3333333334, ans=0.04949747468305833 2023-10-11 02:12:43,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=556640.0, ans=0.0 2023-10-11 02:12:44,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-10-11 02:12:58,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=556686.6666666666, ans=0.1 2023-10-11 02:13:17,314 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.710e+02 1.863e+02 2.164e+02 3.346e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-11 02:13:25,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=556780.0, ans=0.125 2023-10-11 02:13:28,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=556826.6666666666, ans=0.1 2023-10-11 02:13:28,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-10-11 02:13:36,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=556826.6666666666, ans=0.1 2023-10-11 02:13:50,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=556920.0, ans=0.125 2023-10-11 02:13:59,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=556920.0, ans=0.0 2023-10-11 02:14:13,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=557013.3333333334, ans=0.07 2023-10-11 02:14:13,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=557013.3333333334, ans=0.125 2023-10-11 02:14:31,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=557060.0, ans=0.125 2023-10-11 02:14:59,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=557200.0, ans=0.125 2023-10-11 02:15:09,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.824e+02 2.054e+02 2.470e+02 3.937e+02, threshold=4.107e+02, percent-clipped=1.0 2023-10-11 02:15:11,206 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.01 vs. limit=22.5 2023-10-11 02:15:20,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=557293.3333333334, ans=0.125 2023-10-11 02:15:22,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.32 vs. limit=15.0 2023-10-11 02:15:27,674 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:15:27,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=557340.0, ans=0.125 2023-10-11 02:15:28,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=557340.0, ans=0.125 2023-10-11 02:15:41,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=557386.6666666666, ans=0.0 2023-10-11 02:15:45,972 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.84 vs. limit=15.0 2023-10-11 02:16:01,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=557433.3333333334, ans=0.125 2023-10-11 02:16:05,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=557480.0, ans=0.1 2023-10-11 02:16:08,394 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2023-10-11 02:16:10,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.32 vs. limit=15.0 2023-10-11 02:16:20,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=557526.6666666666, ans=0.125 2023-10-11 02:16:25,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=22.5 2023-10-11 02:16:44,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.95 vs. limit=22.5 2023-10-11 02:16:47,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=557620.0, ans=0.0 2023-10-11 02:17:04,609 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.705e+02 1.914e+02 2.207e+02 3.319e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-11 02:17:29,388 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:17:51,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=557900.0, ans=0.125 2023-10-11 02:18:00,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=557900.0, ans=0.0 2023-10-11 02:18:07,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=557946.6666666666, ans=0.05 2023-10-11 02:18:19,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=557993.3333333334, ans=0.04949747468305833 2023-10-11 02:18:28,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=558040.0, ans=0.2 2023-10-11 02:18:33,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=558040.0, ans=0.1 2023-10-11 02:18:47,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=558086.6666666666, ans=0.0 2023-10-11 02:18:47,363 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-10-11 02:19:05,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.677e+02 1.837e+02 2.135e+02 2.957e+02, threshold=3.673e+02, percent-clipped=0.0 2023-10-11 02:19:10,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=558180.0, ans=0.0 2023-10-11 02:19:10,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=558180.0, ans=0.2 2023-10-11 02:19:21,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=558226.6666666666, ans=0.125 2023-10-11 02:19:29,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.52 vs. limit=22.5 2023-10-11 02:19:36,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=558320.0, ans=0.2 2023-10-11 02:19:40,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=558320.0, ans=0.125 2023-10-11 02:19:46,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=558320.0, ans=0.1 2023-10-11 02:20:01,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=558413.3333333334, ans=0.0 2023-10-11 02:20:29,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=558506.6666666666, ans=0.1 2023-10-11 02:20:31,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=558506.6666666666, ans=0.125 2023-10-11 02:20:37,896 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-11 02:20:45,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=558600.0, ans=0.125 2023-10-11 02:20:46,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=558600.0, ans=0.125 2023-10-11 02:20:55,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=558600.0, ans=0.1 2023-10-11 02:20:55,753 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.05 vs. limit=6.0 2023-10-11 02:20:56,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=558646.6666666666, ans=0.0 2023-10-11 02:21:01,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.692e+02 1.848e+02 2.088e+02 3.397e+02, threshold=3.697e+02, percent-clipped=0.0 2023-10-11 02:21:10,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.55 vs. limit=22.5 2023-10-11 02:21:15,250 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-10-11 02:21:17,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=558740.0, ans=0.0 2023-10-11 02:21:21,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=558740.0, ans=0.95 2023-10-11 02:21:28,462 INFO [train.py:1031] (0/4) Epoch 9, batch 10500, loss[loss=0.1875, simple_loss=0.28, pruned_loss=0.04746, over 16886.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2965, pruned_loss=0.06065, over 32616240.68 frames. ], batch size: 82, lr: 3.96e-03, grad_scale: 32.0 2023-10-11 02:21:40,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=558833.3333333334, ans=0.02 2023-10-11 02:22:00,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.72 vs. limit=12.0 2023-10-11 02:22:01,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2023-10-11 02:22:10,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=558973.3333333334, ans=0.2 2023-10-11 02:22:13,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=558973.3333333334, ans=0.0 2023-10-11 02:22:20,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-10-11 02:22:29,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=559020.0, ans=0.0 2023-10-11 02:22:53,368 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2023-10-11 02:22:53,596 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.706e+02 1.870e+02 2.179e+02 2.943e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-11 02:23:01,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=559113.3333333334, ans=0.0 2023-10-11 02:23:04,141 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.26 vs. limit=15.0 2023-10-11 02:23:11,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=559160.0, ans=0.0 2023-10-11 02:23:12,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.84 vs. limit=10.0 2023-10-11 02:23:13,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=559160.0, ans=0.0 2023-10-11 02:23:31,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.67 vs. limit=15.0 2023-10-11 02:23:36,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=559253.3333333334, ans=0.0 2023-10-11 02:23:43,250 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-10-11 02:24:23,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=559486.6666666666, ans=0.0 2023-10-11 02:24:45,505 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:24:47,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=559580.0, ans=0.125 2023-10-11 02:24:48,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.701e+02 1.882e+02 2.092e+02 3.275e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-11 02:25:00,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=559626.6666666666, ans=0.125 2023-10-11 02:25:14,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=559673.3333333334, ans=0.125 2023-10-11 02:25:15,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=559673.3333333334, ans=0.0 2023-10-11 02:25:16,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=559673.3333333334, ans=0.1 2023-10-11 02:25:50,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=559813.3333333334, ans=0.125 2023-10-11 02:26:00,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=559860.0, ans=0.1 2023-10-11 02:26:18,791 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.18 vs. limit=15.0 2023-10-11 02:26:24,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-10-11 02:26:25,988 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-120000.pt 2023-10-11 02:26:30,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=560000.0, ans=0.125 2023-10-11 02:26:35,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=560000.0, ans=0.1 2023-10-11 02:26:46,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.235e+02 1.759e+02 2.001e+02 2.384e+02 4.309e+02, threshold=4.001e+02, percent-clipped=4.0 2023-10-11 02:26:52,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=560093.3333333334, ans=0.0 2023-10-11 02:27:10,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=560140.0, ans=0.125 2023-10-11 02:27:15,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-11 02:27:18,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=560186.6666666666, ans=0.2 2023-10-11 02:27:20,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=560186.6666666666, ans=0.09899494936611666 2023-10-11 02:27:26,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=560233.3333333334, ans=0.1 2023-10-11 02:28:04,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=560373.3333333334, ans=0.125 2023-10-11 02:28:07,698 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.87 vs. limit=22.5 2023-10-11 02:28:21,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=560466.6666666666, ans=0.1 2023-10-11 02:28:40,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.644e+02 1.831e+02 2.037e+02 3.139e+02, threshold=3.662e+02, percent-clipped=0.0 2023-10-11 02:28:45,647 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.66 vs. limit=10.0 2023-10-11 02:28:57,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-10-11 02:29:17,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=560653.3333333334, ans=0.0 2023-10-11 02:29:19,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=560653.3333333334, ans=0.125 2023-10-11 02:29:28,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=560700.0, ans=0.0 2023-10-11 02:29:30,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=560746.6666666666, ans=0.035 2023-10-11 02:29:36,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=560746.6666666666, ans=0.2 2023-10-11 02:29:47,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=560793.3333333334, ans=0.0 2023-10-11 02:29:53,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=560840.0, ans=0.0 2023-10-11 02:30:14,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-10-11 02:30:29,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.628e+02 1.861e+02 2.137e+02 3.350e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-11 02:30:33,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-10-11 02:30:58,628 INFO [train.py:1031] (0/4) Epoch 9, batch 11000, loss[loss=0.2056, simple_loss=0.2912, pruned_loss=0.05996, over 16645.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2966, pruned_loss=0.06081, over 32674544.98 frames. ], batch size: 61, lr: 3.95e-03, grad_scale: 32.0 2023-10-11 02:31:06,319 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2023-10-11 02:31:23,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=561213.3333333334, ans=0.125 2023-10-11 02:31:26,839 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.93 vs. limit=15.0 2023-10-11 02:31:32,915 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.49 vs. limit=15.0 2023-10-11 02:31:33,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=561260.0, ans=0.0 2023-10-11 02:31:39,197 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.99 vs. limit=12.0 2023-10-11 02:31:39,330 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.21 vs. limit=15.0 2023-10-11 02:32:03,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=561400.0, ans=0.0 2023-10-11 02:32:19,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=561446.6666666666, ans=0.0 2023-10-11 02:32:24,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.842e+02 2.021e+02 2.298e+02 3.997e+02, threshold=4.043e+02, percent-clipped=1.0 2023-10-11 02:32:39,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=561493.3333333334, ans=0.125 2023-10-11 02:32:43,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=561540.0, ans=0.2 2023-10-11 02:33:01,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=561586.6666666666, ans=0.125 2023-10-11 02:33:22,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=561680.0, ans=0.125 2023-10-11 02:33:23,515 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:33:23,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=561680.0, ans=0.0 2023-10-11 02:33:28,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=561680.0, ans=0.0 2023-10-11 02:33:32,604 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-10-11 02:33:46,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.27 vs. limit=15.0 2023-10-11 02:33:57,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=561773.3333333334, ans=0.125 2023-10-11 02:34:08,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=561820.0, ans=0.125 2023-10-11 02:34:25,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.688e+02 1.877e+02 2.286e+02 3.190e+02, threshold=3.755e+02, percent-clipped=0.0 2023-10-11 02:34:46,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=562006.6666666666, ans=0.125 2023-10-11 02:35:00,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=562053.3333333334, ans=0.0 2023-10-11 02:35:15,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=562146.6666666666, ans=0.0 2023-10-11 02:35:27,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=562193.3333333334, ans=0.0 2023-10-11 02:35:49,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=562286.6666666666, ans=0.1 2023-10-11 02:35:55,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=562286.6666666666, ans=0.125 2023-10-11 02:36:13,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=562380.0, ans=0.125 2023-10-11 02:36:17,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.633e+02 1.767e+02 1.977e+02 2.630e+02, threshold=3.535e+02, percent-clipped=0.0 2023-10-11 02:36:20,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=562380.0, ans=0.125 2023-10-11 02:36:20,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=562380.0, ans=0.2 2023-10-11 02:37:09,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=562566.6666666666, ans=0.125 2023-10-11 02:37:09,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=562566.6666666666, ans=0.1 2023-10-11 02:37:17,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=562613.3333333334, ans=0.025 2023-10-11 02:37:29,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=562660.0, ans=0.125 2023-10-11 02:37:41,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=562706.6666666666, ans=0.0 2023-10-11 02:37:57,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=562800.0, ans=0.125 2023-10-11 02:38:01,598 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.74 vs. limit=12.0 2023-10-11 02:38:12,793 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.660e+02 1.855e+02 2.054e+02 3.168e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-11 02:38:15,411 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.36 vs. limit=15.0 2023-10-11 02:38:18,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=562893.3333333334, ans=0.2 2023-10-11 02:38:39,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=562986.6666666666, ans=0.125 2023-10-11 02:38:46,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=562986.6666666666, ans=0.0 2023-10-11 02:38:48,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-10-11 02:38:49,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.04 vs. limit=15.0 2023-10-11 02:38:53,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-10-11 02:39:04,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=563080.0, ans=0.0 2023-10-11 02:39:43,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=563220.0, ans=0.125 2023-10-11 02:39:56,078 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.50 vs. limit=15.0 2023-10-11 02:40:04,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.783e+02 1.986e+02 2.246e+02 3.484e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-11 02:40:16,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=563360.0, ans=0.125 2023-10-11 02:40:32,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=563453.3333333334, ans=0.125 2023-10-11 02:40:33,081 INFO [train.py:1031] (0/4) Epoch 9, batch 11500, loss[loss=0.2242, simple_loss=0.3136, pruned_loss=0.06739, over 16986.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2964, pruned_loss=0.06065, over 32697525.62 frames. ], batch size: 117, lr: 3.95e-03, grad_scale: 32.0 2023-10-11 02:40:33,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=563453.3333333334, ans=0.2 2023-10-11 02:40:37,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=563453.3333333334, ans=0.1 2023-10-11 02:40:49,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=563500.0, ans=0.5 2023-10-11 02:40:59,840 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.95 vs. limit=22.5 2023-10-11 02:41:04,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.67 vs. limit=15.0 2023-10-11 02:41:05,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=563593.3333333334, ans=0.125 2023-10-11 02:41:21,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=563640.0, ans=0.5 2023-10-11 02:41:22,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=563640.0, ans=0.125 2023-10-11 02:41:23,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=563640.0, ans=0.125 2023-10-11 02:41:32,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=563686.6666666666, ans=0.1 2023-10-11 02:41:44,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=563733.3333333334, ans=0.0 2023-10-11 02:41:56,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.711e+02 1.934e+02 2.141e+02 2.814e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-11 02:42:06,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=563826.6666666666, ans=0.1 2023-10-11 02:42:28,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=563920.0, ans=0.125 2023-10-11 02:42:39,719 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=15.0 2023-10-11 02:42:57,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=564013.3333333334, ans=0.125 2023-10-11 02:42:58,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.96 vs. limit=15.0 2023-10-11 02:43:23,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=564106.6666666666, ans=10.0 2023-10-11 02:43:37,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=564200.0, ans=0.2 2023-10-11 02:43:51,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.630e+02 1.759e+02 2.031e+02 2.945e+02, threshold=3.519e+02, percent-clipped=0.0 2023-10-11 02:43:51,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-11 02:44:00,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=564293.3333333334, ans=0.2 2023-10-11 02:44:02,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=564293.3333333334, ans=0.1 2023-10-11 02:44:09,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.78 vs. limit=22.5 2023-10-11 02:44:24,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=564386.6666666666, ans=0.1 2023-10-11 02:44:37,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=564433.3333333334, ans=0.2 2023-10-11 02:44:53,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=564526.6666666666, ans=0.1 2023-10-11 02:44:55,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=564526.6666666666, ans=0.125 2023-10-11 02:45:31,790 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 02:45:39,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=564666.6666666666, ans=0.125 2023-10-11 02:45:45,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=564666.6666666666, ans=0.125 2023-10-11 02:45:49,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.69 vs. limit=15.0 2023-10-11 02:45:54,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.255e+02 1.641e+02 1.849e+02 2.148e+02 2.920e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-11 02:46:13,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=564806.6666666666, ans=0.0 2023-10-11 02:46:21,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=564806.6666666666, ans=0.1 2023-10-11 02:46:25,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=564853.3333333334, ans=0.2 2023-10-11 02:46:34,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=564853.3333333334, ans=0.125 2023-10-11 02:46:38,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=564900.0, ans=0.125 2023-10-11 02:46:41,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=564900.0, ans=0.0 2023-10-11 02:46:42,656 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.05 vs. limit=15.0 2023-10-11 02:47:11,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.82 vs. limit=15.0 2023-10-11 02:47:52,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.667e+02 1.867e+02 2.069e+02 2.411e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-11 02:47:58,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=15.0 2023-10-11 02:48:11,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-10-11 02:48:14,469 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.42 vs. limit=22.5 2023-10-11 02:48:22,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=565273.3333333334, ans=0.125 2023-10-11 02:48:42,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=565366.6666666666, ans=0.0 2023-10-11 02:49:05,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=565460.0, ans=0.2 2023-10-11 02:49:11,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=565506.6666666666, ans=0.2 2023-10-11 02:49:20,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=565506.6666666666, ans=0.125 2023-10-11 02:49:53,139 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.671e+02 1.825e+02 2.037e+02 2.675e+02, threshold=3.650e+02, percent-clipped=0.0 2023-10-11 02:50:04,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=565693.3333333334, ans=0.1 2023-10-11 02:50:14,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.63 vs. limit=12.0 2023-10-11 02:50:22,696 INFO [train.py:1031] (0/4) Epoch 9, batch 12000, loss[loss=0.1976, simple_loss=0.2855, pruned_loss=0.05484, over 16563.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2963, pruned_loss=0.0604, over 32707428.11 frames. ], batch size: 61, lr: 3.94e-03, grad_scale: 32.0 2023-10-11 02:50:35,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=565833.3333333334, ans=0.125 2023-10-11 02:50:48,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.92 vs. limit=15.0 2023-10-11 02:50:51,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=565880.0, ans=0.0 2023-10-11 02:51:01,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=565926.6666666666, ans=0.0 2023-10-11 02:51:04,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=565926.6666666666, ans=0.2 2023-10-11 02:51:11,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=565973.3333333334, ans=0.09899494936611666 2023-10-11 02:51:17,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-10-11 02:51:24,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.18 vs. limit=6.0 2023-10-11 02:51:49,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.662e+02 1.863e+02 2.149e+02 3.184e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-11 02:52:07,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=566206.6666666666, ans=0.125 2023-10-11 02:52:15,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=566206.6666666666, ans=0.125 2023-10-11 02:52:15,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=566206.6666666666, ans=0.1 2023-10-11 02:52:30,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-10-11 02:52:35,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=566300.0, ans=0.0 2023-10-11 02:52:45,350 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.31 vs. limit=6.0 2023-10-11 02:52:48,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.04 vs. limit=22.5 2023-10-11 02:52:49,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=566393.3333333334, ans=0.125 2023-10-11 02:53:04,399 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.84 vs. limit=15.0 2023-10-11 02:53:16,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=566486.6666666666, ans=0.07 2023-10-11 02:53:16,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=566486.6666666666, ans=0.0 2023-10-11 02:53:24,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=566533.3333333334, ans=0.125 2023-10-11 02:53:35,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=566580.0, ans=0.0 2023-10-11 02:53:36,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.622e+02 1.807e+02 2.087e+02 2.790e+02, threshold=3.613e+02, percent-clipped=0.0 2023-10-11 02:53:36,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=566580.0, ans=0.0 2023-10-11 02:53:48,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=566626.6666666666, ans=0.0 2023-10-11 02:54:00,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=566673.3333333334, ans=0.125 2023-10-11 02:54:07,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=566720.0, ans=0.0 2023-10-11 02:54:08,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=566720.0, ans=0.05 2023-10-11 02:54:11,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=566720.0, ans=0.1 2023-10-11 02:54:19,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=566766.6666666666, ans=0.125 2023-10-11 02:54:31,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=566813.3333333334, ans=0.0 2023-10-11 02:54:38,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=566813.3333333334, ans=0.0 2023-10-11 02:54:39,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=566860.0, ans=0.125 2023-10-11 02:55:03,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=566953.3333333334, ans=0.0 2023-10-11 02:55:10,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=566953.3333333334, ans=0.125 2023-10-11 02:55:15,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=567000.0, ans=0.125 2023-10-11 02:55:29,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.705e+02 1.875e+02 2.192e+02 3.255e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-11 02:55:40,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-10-11 02:55:42,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-10-11 02:55:49,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=567140.0, ans=0.125 2023-10-11 02:56:29,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=567280.0, ans=0.125 2023-10-11 02:56:42,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.50 vs. limit=15.0 2023-10-11 02:56:46,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=567373.3333333334, ans=0.0 2023-10-11 02:56:48,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.49 vs. limit=15.0 2023-10-11 02:56:50,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=567373.3333333334, ans=0.0 2023-10-11 02:56:53,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=567373.3333333334, ans=0.1 2023-10-11 02:56:58,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=567420.0, ans=0.125 2023-10-11 02:57:08,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=567466.6666666666, ans=0.125 2023-10-11 02:57:09,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=567466.6666666666, ans=0.125 2023-10-11 02:57:11,615 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.68 vs. limit=15.0 2023-10-11 02:57:26,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.09 vs. limit=6.0 2023-10-11 02:57:29,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.718e+02 1.933e+02 2.276e+02 3.451e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-11 02:57:45,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=567560.0, ans=0.0 2023-10-11 02:57:50,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.68 vs. limit=10.0 2023-10-11 02:57:53,356 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.64 vs. limit=6.0 2023-10-11 02:58:39,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=567793.3333333334, ans=0.2 2023-10-11 02:59:16,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=567933.3333333334, ans=10.0 2023-10-11 02:59:17,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=567933.3333333334, ans=0.2 2023-10-11 02:59:17,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.88 vs. limit=15.0 2023-10-11 02:59:22,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=567980.0, ans=0.125 2023-10-11 02:59:25,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.736e+02 1.910e+02 2.290e+02 4.074e+02, threshold=3.820e+02, percent-clipped=1.0 2023-10-11 02:59:45,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-10-11 02:59:45,439 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.77 vs. limit=22.5 2023-10-11 02:59:52,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=568073.3333333334, ans=0.0 2023-10-11 02:59:54,604 INFO [train.py:1031] (0/4) Epoch 9, batch 12500, loss[loss=0.2196, simple_loss=0.2817, pruned_loss=0.07878, over 12934.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.296, pruned_loss=0.06058, over 32696276.56 frames. ], batch size: 440, lr: 3.93e-03, grad_scale: 32.0 2023-10-11 03:00:07,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-10-11 03:00:21,717 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=15.0 2023-10-11 03:00:28,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=568260.0, ans=10.0 2023-10-11 03:00:38,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=568260.0, ans=0.05 2023-10-11 03:00:44,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=568306.6666666666, ans=0.125 2023-10-11 03:00:52,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=568353.3333333334, ans=10.0 2023-10-11 03:01:03,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=568400.0, ans=0.5 2023-10-11 03:01:07,400 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.40 vs. limit=15.0 2023-10-11 03:01:15,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.718e+02 1.944e+02 2.290e+02 3.137e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-11 03:01:29,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=568493.3333333334, ans=0.0 2023-10-11 03:01:35,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.70 vs. limit=15.0 2023-10-11 03:01:57,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=568633.3333333334, ans=0.0 2023-10-11 03:01:58,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=568633.3333333334, ans=0.125 2023-10-11 03:02:00,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=568633.3333333334, ans=0.2 2023-10-11 03:02:29,035 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.16 vs. limit=10.0 2023-10-11 03:02:31,008 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-10-11 03:02:36,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=568773.3333333334, ans=0.125 2023-10-11 03:02:37,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=568773.3333333334, ans=0.125 2023-10-11 03:02:56,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=568866.6666666666, ans=0.125 2023-10-11 03:03:06,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=15.0 2023-10-11 03:03:08,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.279e+02 1.590e+02 1.783e+02 2.035e+02 3.237e+02, threshold=3.565e+02, percent-clipped=0.0 2023-10-11 03:03:11,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=568913.3333333334, ans=0.2 2023-10-11 03:03:14,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=568960.0, ans=0.1 2023-10-11 03:03:38,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=569053.3333333334, ans=0.125 2023-10-11 03:03:41,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=569053.3333333334, ans=0.025 2023-10-11 03:03:49,619 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-11 03:03:52,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=569100.0, ans=0.1 2023-10-11 03:04:08,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=569146.6666666666, ans=0.0 2023-10-11 03:04:21,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=569240.0, ans=0.125 2023-10-11 03:04:27,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=569240.0, ans=0.0 2023-10-11 03:04:29,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=569240.0, ans=0.035 2023-10-11 03:04:30,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=569240.0, ans=0.2 2023-10-11 03:05:01,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=569380.0, ans=0.1 2023-10-11 03:05:01,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.690e+02 1.800e+02 2.047e+02 3.021e+02, threshold=3.599e+02, percent-clipped=0.0 2023-10-11 03:05:02,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=569380.0, ans=0.0 2023-10-11 03:05:20,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=569473.3333333334, ans=0.0 2023-10-11 03:05:24,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.70 vs. limit=12.0 2023-10-11 03:05:27,582 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.05 vs. limit=6.0 2023-10-11 03:05:49,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=569613.3333333334, ans=0.0 2023-10-11 03:05:58,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=569613.3333333334, ans=0.2 2023-10-11 03:06:03,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=569660.0, ans=0.125 2023-10-11 03:06:05,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=569660.0, ans=0.0 2023-10-11 03:06:16,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=569706.6666666666, ans=0.0 2023-10-11 03:06:39,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=569800.0, ans=0.125 2023-10-11 03:06:55,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.736e+02 1.913e+02 2.154e+02 3.080e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-11 03:07:20,128 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:07:21,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=569986.6666666666, ans=0.1 2023-10-11 03:07:25,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=569986.6666666666, ans=0.0 2023-10-11 03:07:31,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=569986.6666666666, ans=0.125 2023-10-11 03:07:31,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=569986.6666666666, ans=0.2 2023-10-11 03:07:32,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=570033.3333333334, ans=0.125 2023-10-11 03:07:48,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=570080.0, ans=0.0 2023-10-11 03:08:00,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=570126.6666666666, ans=0.0 2023-10-11 03:08:07,310 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-10-11 03:08:37,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=570313.3333333334, ans=0.125 2023-10-11 03:08:37,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=570313.3333333334, ans=0.125 2023-10-11 03:08:43,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.622e+02 1.783e+02 1.931e+02 2.500e+02, threshold=3.567e+02, percent-clipped=0.0 2023-10-11 03:08:55,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=570360.0, ans=0.125 2023-10-11 03:09:09,322 INFO [train.py:1031] (0/4) Epoch 9, batch 13000, loss[loss=0.2075, simple_loss=0.2982, pruned_loss=0.05836, over 16857.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2967, pruned_loss=0.06062, over 32724273.92 frames. ], batch size: 87, lr: 3.92e-03, grad_scale: 32.0 2023-10-11 03:09:25,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=570500.0, ans=0.05 2023-10-11 03:09:59,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=570640.0, ans=0.125 2023-10-11 03:10:24,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=570733.3333333334, ans=0.125 2023-10-11 03:10:31,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=570733.3333333334, ans=0.125 2023-10-11 03:10:34,596 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:10:36,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.81 vs. limit=12.0 2023-10-11 03:10:36,846 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.83 vs. limit=15.0 2023-10-11 03:10:38,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.58 vs. limit=22.5 2023-10-11 03:10:44,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=570780.0, ans=0.0 2023-10-11 03:10:44,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=570780.0, ans=0.125 2023-10-11 03:10:44,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2023-10-11 03:10:45,075 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.699e+02 1.852e+02 2.104e+02 3.732e+02, threshold=3.704e+02, percent-clipped=1.0 2023-10-11 03:10:49,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=570826.6666666666, ans=0.125 2023-10-11 03:10:54,343 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-10-11 03:10:56,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=570826.6666666666, ans=0.2 2023-10-11 03:11:01,680 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.88 vs. limit=15.0 2023-10-11 03:11:08,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=570873.3333333334, ans=0.125 2023-10-11 03:11:22,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=570966.6666666666, ans=0.0 2023-10-11 03:11:30,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=570966.6666666666, ans=0.07 2023-10-11 03:11:32,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=571013.3333333334, ans=0.1 2023-10-11 03:11:53,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=571060.0, ans=0.125 2023-10-11 03:12:09,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=571153.3333333334, ans=0.0 2023-10-11 03:12:21,208 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.95 vs. limit=15.0 2023-10-11 03:12:22,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=571200.0, ans=0.0 2023-10-11 03:12:29,614 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:12:32,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=571246.6666666666, ans=0.1 2023-10-11 03:12:38,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.698e+02 1.954e+02 2.242e+02 3.172e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-11 03:13:33,445 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.71 vs. limit=8.0 2023-10-11 03:13:38,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=571480.0, ans=0.125 2023-10-11 03:13:41,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=571526.6666666666, ans=0.0 2023-10-11 03:14:03,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=571620.0, ans=0.1 2023-10-11 03:14:15,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=571666.6666666666, ans=0.125 2023-10-11 03:14:32,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.39 vs. limit=15.0 2023-10-11 03:14:35,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.623e+02 1.776e+02 1.937e+02 2.839e+02, threshold=3.553e+02, percent-clipped=0.0 2023-10-11 03:14:36,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=571713.3333333334, ans=0.0 2023-10-11 03:14:37,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=571760.0, ans=0.0 2023-10-11 03:14:47,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=571760.0, ans=0.2 2023-10-11 03:14:48,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=571806.6666666666, ans=0.1 2023-10-11 03:15:03,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=571853.3333333334, ans=0.2 2023-10-11 03:15:11,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=571900.0, ans=0.2 2023-10-11 03:15:12,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=571900.0, ans=0.1 2023-10-11 03:15:21,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=571946.6666666666, ans=0.125 2023-10-11 03:15:32,657 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.14 vs. limit=15.0 2023-10-11 03:15:53,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=572086.6666666666, ans=0.125 2023-10-11 03:16:02,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=572086.6666666666, ans=0.125 2023-10-11 03:16:04,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=572133.3333333334, ans=0.1 2023-10-11 03:16:12,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.90 vs. limit=22.5 2023-10-11 03:16:24,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.698e+02 1.866e+02 2.146e+02 3.402e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-11 03:16:28,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=572226.6666666666, ans=0.125 2023-10-11 03:16:37,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=572226.6666666666, ans=0.02 2023-10-11 03:16:43,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-10-11 03:16:44,971 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.87 vs. limit=15.0 2023-10-11 03:16:50,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-11 03:16:52,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.53 vs. limit=15.0 2023-10-11 03:16:58,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=572320.0, ans=0.0 2023-10-11 03:17:22,226 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:17:48,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=572553.3333333334, ans=0.125 2023-10-11 03:17:53,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=572553.3333333334, ans=0.125 2023-10-11 03:17:59,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=572600.0, ans=0.1 2023-10-11 03:18:03,089 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:18:04,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=572600.0, ans=0.125 2023-10-11 03:18:11,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=572646.6666666666, ans=0.0 2023-10-11 03:18:18,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.622e+02 1.866e+02 2.124e+02 3.273e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-11 03:18:23,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=572693.3333333334, ans=0.125 2023-10-11 03:18:37,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.35 vs. limit=22.5 2023-10-11 03:18:40,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=572740.0, ans=0.0 2023-10-11 03:18:41,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=572786.6666666666, ans=0.125 2023-10-11 03:18:41,618 INFO [train.py:1031] (0/4) Epoch 9, batch 13500, loss[loss=0.2234, simple_loss=0.3072, pruned_loss=0.0698, over 16596.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2958, pruned_loss=0.06016, over 32748832.30 frames. ], batch size: 241, lr: 3.91e-03, grad_scale: 16.0 2023-10-11 03:18:59,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=572833.3333333334, ans=0.125 2023-10-11 03:19:02,604 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.13 vs. limit=15.0 2023-10-11 03:19:06,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=572880.0, ans=0.1 2023-10-11 03:19:20,655 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.93 vs. limit=22.5 2023-10-11 03:19:38,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=573020.0, ans=0.125 2023-10-11 03:20:10,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.631e+02 1.771e+02 2.012e+02 3.098e+02, threshold=3.542e+02, percent-clipped=0.0 2023-10-11 03:20:11,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=573113.3333333334, ans=0.125 2023-10-11 03:20:16,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=573160.0, ans=0.125 2023-10-11 03:21:26,745 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-9.pt 2023-10-11 03:21:56,146 INFO [train.py:1031] (0/4) Epoch 10, batch 0, loss[loss=0.1937, simple_loss=0.2743, pruned_loss=0.05655, over 16873.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2743, pruned_loss=0.05655, over 16873.00 frames. ], batch size: 123, lr: 3.69e-03, grad_scale: 32.0 2023-10-11 03:21:56,148 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-11 03:22:04,440 INFO [train.py:1063] (0/4) Epoch 10, validation: loss=0.221, simple_loss=0.3086, pruned_loss=0.06676, over 1020973.00 frames. 2023-10-11 03:22:04,440 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-11 03:22:18,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=573556.6666666666, ans=0.0 2023-10-11 03:22:24,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=573556.6666666666, ans=0.1 2023-10-11 03:22:24,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=573556.6666666666, ans=0.1 2023-10-11 03:22:33,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.797e+02 1.975e+02 2.272e+02 3.905e+02, threshold=3.949e+02, percent-clipped=2.0 2023-10-11 03:22:55,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=573696.6666666666, ans=0.2 2023-10-11 03:22:57,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=573696.6666666666, ans=0.1 2023-10-11 03:23:26,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=573790.0, ans=0.0 2023-10-11 03:23:26,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=573790.0, ans=0.2 2023-10-11 03:23:31,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=573836.6666666666, ans=0.125 2023-10-11 03:23:46,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=573883.3333333334, ans=0.09899494936611666 2023-10-11 03:24:02,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=573930.0, ans=0.125 2023-10-11 03:24:23,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=574023.3333333334, ans=0.0 2023-10-11 03:24:30,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.587e+02 1.689e+02 1.848e+02 2.696e+02, threshold=3.378e+02, percent-clipped=0.0 2023-10-11 03:24:31,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=574070.0, ans=0.2 2023-10-11 03:25:14,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=574256.6666666666, ans=0.125 2023-10-11 03:25:33,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=574350.0, ans=0.1 2023-10-11 03:25:50,455 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.85 vs. limit=10.0 2023-10-11 03:26:10,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=574490.0, ans=0.125 2023-10-11 03:26:15,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.46 vs. limit=22.5 2023-10-11 03:26:20,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.692e+02 1.849e+02 2.100e+02 3.049e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-11 03:26:26,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=574536.6666666666, ans=0.125 2023-10-11 03:26:27,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=574583.3333333334, ans=0.1 2023-10-11 03:26:30,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=574583.3333333334, ans=0.0 2023-10-11 03:26:57,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=574676.6666666666, ans=0.125 2023-10-11 03:27:03,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=574676.6666666666, ans=0.07 2023-10-11 03:27:05,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=574723.3333333334, ans=0.0 2023-10-11 03:27:10,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=574723.3333333334, ans=0.0 2023-10-11 03:27:13,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=574723.3333333334, ans=0.0 2023-10-11 03:27:21,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=574770.0, ans=0.0 2023-10-11 03:27:24,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=574770.0, ans=0.125 2023-10-11 03:27:35,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=574816.6666666666, ans=0.0 2023-10-11 03:27:40,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.25 vs. limit=15.0 2023-10-11 03:27:41,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=574863.3333333334, ans=0.125 2023-10-11 03:27:59,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=574910.0, ans=0.025 2023-10-11 03:28:17,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.682e+02 1.839e+02 2.067e+02 3.304e+02, threshold=3.677e+02, percent-clipped=0.0 2023-10-11 03:28:30,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=575050.0, ans=0.125 2023-10-11 03:28:57,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.44 vs. limit=5.0 2023-10-11 03:29:01,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=575190.0, ans=0.125 2023-10-11 03:29:50,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=575376.6666666666, ans=0.1 2023-10-11 03:29:59,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.99 vs. limit=10.0 2023-10-11 03:30:05,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=575470.0, ans=0.09899494936611666 2023-10-11 03:30:08,541 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.803e+02 1.967e+02 2.167e+02 3.200e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-11 03:30:09,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=575470.0, ans=0.2 2023-10-11 03:30:16,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=575516.6666666666, ans=0.125 2023-10-11 03:30:18,648 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-10-11 03:30:20,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.38 vs. limit=15.0 2023-10-11 03:30:21,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=575516.6666666666, ans=0.125 2023-10-11 03:30:38,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=575610.0, ans=0.125 2023-10-11 03:30:44,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=575610.0, ans=0.07 2023-10-11 03:30:48,146 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.68 vs. limit=15.0 2023-10-11 03:30:49,672 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:30:50,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-10-11 03:31:31,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=575796.6666666666, ans=0.2 2023-10-11 03:31:35,842 INFO [train.py:1031] (0/4) Epoch 10, batch 500, loss[loss=0.1998, simple_loss=0.2604, pruned_loss=0.0696, over 12548.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.295, pruned_loss=0.05983, over 7288984.91 frames. ], batch size: 440, lr: 3.68e-03, grad_scale: 32.0 2023-10-11 03:31:41,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=575843.3333333334, ans=0.2 2023-10-11 03:32:02,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.738e+02 1.898e+02 2.105e+02 2.822e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-11 03:32:07,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=575936.6666666666, ans=0.125 2023-10-11 03:32:14,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=575983.3333333334, ans=0.95 2023-10-11 03:32:24,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.74 vs. limit=15.0 2023-10-11 03:32:25,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=576030.0, ans=6.0 2023-10-11 03:32:31,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=576076.6666666666, ans=0.0 2023-10-11 03:32:39,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=576076.6666666666, ans=0.0 2023-10-11 03:32:49,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=576123.3333333334, ans=0.1 2023-10-11 03:32:49,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=576123.3333333334, ans=0.0 2023-10-11 03:32:50,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-10-11 03:32:56,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=576170.0, ans=0.0 2023-10-11 03:33:15,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.26 vs. limit=6.0 2023-10-11 03:33:20,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=576263.3333333334, ans=0.0 2023-10-11 03:33:36,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=576310.0, ans=0.125 2023-10-11 03:33:52,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=576403.3333333334, ans=0.07 2023-10-11 03:33:52,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.743e+02 1.919e+02 2.261e+02 3.270e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-11 03:34:15,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-10-11 03:34:17,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=576496.6666666666, ans=0.1 2023-10-11 03:34:36,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=576590.0, ans=0.125 2023-10-11 03:34:43,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=576636.6666666666, ans=0.0 2023-10-11 03:35:37,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=576870.0, ans=0.125 2023-10-11 03:35:39,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=576870.0, ans=0.09899494936611666 2023-10-11 03:35:41,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.830e+02 2.020e+02 2.223e+02 3.472e+02, threshold=4.041e+02, percent-clipped=0.0 2023-10-11 03:35:41,522 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:35:55,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=576916.6666666666, ans=0.125 2023-10-11 03:36:16,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=577010.0, ans=0.125 2023-10-11 03:36:16,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-10-11 03:36:24,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=577056.6666666666, ans=0.125 2023-10-11 03:36:27,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=577056.6666666666, ans=0.0 2023-10-11 03:36:33,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=577056.6666666666, ans=0.125 2023-10-11 03:36:55,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=577150.0, ans=0.0 2023-10-11 03:37:31,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=577290.0, ans=0.125 2023-10-11 03:37:36,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.726e+02 1.900e+02 2.168e+02 3.407e+02, threshold=3.799e+02, percent-clipped=0.0 2023-10-11 03:38:04,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=577430.0, ans=0.1 2023-10-11 03:38:26,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=577523.3333333334, ans=0.2 2023-10-11 03:38:30,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=577523.3333333334, ans=0.125 2023-10-11 03:38:30,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=577523.3333333334, ans=0.2 2023-10-11 03:38:48,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=577616.6666666666, ans=0.1 2023-10-11 03:38:51,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=577616.6666666666, ans=0.125 2023-10-11 03:39:05,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.61 vs. limit=8.0 2023-10-11 03:39:13,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=577710.0, ans=0.125 2023-10-11 03:39:29,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=577803.3333333334, ans=0.2 2023-10-11 03:39:32,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.731e+02 1.961e+02 2.357e+02 3.371e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-11 03:39:55,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=577896.6666666666, ans=0.1 2023-10-11 03:40:00,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=577896.6666666666, ans=0.0 2023-10-11 03:40:23,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-10-11 03:40:50,240 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:40:59,156 INFO [train.py:1031] (0/4) Epoch 10, batch 1000, loss[loss=0.2194, simple_loss=0.3119, pruned_loss=0.06348, over 16865.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.296, pruned_loss=0.0602, over 12940801.33 frames. ], batch size: 130, lr: 3.68e-03, grad_scale: 32.0 2023-10-11 03:41:23,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.186e+02 1.625e+02 1.755e+02 1.937e+02 2.663e+02, threshold=3.510e+02, percent-clipped=0.0 2023-10-11 03:41:23,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=578270.0, ans=0.2 2023-10-11 03:41:25,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=578270.0, ans=0.125 2023-10-11 03:41:31,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=578316.6666666666, ans=0.125 2023-10-11 03:41:43,228 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.72 vs. limit=5.0 2023-10-11 03:41:46,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=578363.3333333334, ans=0.2 2023-10-11 03:41:50,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=578410.0, ans=0.0 2023-10-11 03:41:56,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=578410.0, ans=0.0 2023-10-11 03:41:57,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=578410.0, ans=0.0 2023-10-11 03:41:57,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-10-11 03:41:59,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=578410.0, ans=0.025 2023-10-11 03:42:09,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=578456.6666666666, ans=0.125 2023-10-11 03:42:14,861 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.41 vs. limit=10.0 2023-10-11 03:42:25,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=578550.0, ans=0.0 2023-10-11 03:42:31,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=578550.0, ans=0.125 2023-10-11 03:42:45,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=578643.3333333334, ans=0.2 2023-10-11 03:42:51,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.74 vs. limit=22.5 2023-10-11 03:43:15,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.799e+02 2.083e+02 2.378e+02 3.396e+02, threshold=4.166e+02, percent-clipped=0.0 2023-10-11 03:43:23,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=578783.3333333334, ans=0.2 2023-10-11 03:43:31,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=578783.3333333334, ans=0.125 2023-10-11 03:43:43,830 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.84 vs. limit=10.0 2023-10-11 03:45:16,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.252e+02 1.588e+02 1.788e+02 2.035e+02 3.161e+02, threshold=3.576e+02, percent-clipped=0.0 2023-10-11 03:45:38,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=579296.6666666666, ans=0.0 2023-10-11 03:45:44,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-10-11 03:46:02,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=579390.0, ans=0.1 2023-10-11 03:46:18,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=579483.3333333334, ans=0.0 2023-10-11 03:46:20,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=579483.3333333334, ans=0.0 2023-10-11 03:46:23,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=579483.3333333334, ans=0.125 2023-10-11 03:47:06,698 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.715e+02 1.931e+02 2.139e+02 3.229e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-11 03:47:20,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=579716.6666666666, ans=0.1 2023-10-11 03:47:45,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=579856.6666666666, ans=0.125 2023-10-11 03:47:53,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=579856.6666666666, ans=0.1 2023-10-11 03:48:05,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=579903.3333333334, ans=0.125 2023-10-11 03:48:21,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.21 vs. limit=15.0 2023-10-11 03:48:25,494 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.48 vs. limit=15.0 2023-10-11 03:48:25,692 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=12.0 2023-10-11 03:48:42,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=580090.0, ans=0.0 2023-10-11 03:48:47,737 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.65 vs. limit=22.5 2023-10-11 03:48:50,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=580090.0, ans=0.125 2023-10-11 03:48:55,114 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.736e+02 1.988e+02 2.305e+02 3.684e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-11 03:49:05,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=580183.3333333334, ans=0.125 2023-10-11 03:49:13,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=580183.3333333334, ans=0.0 2023-10-11 03:49:16,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=580230.0, ans=10.0 2023-10-11 03:49:23,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=580230.0, ans=0.2 2023-10-11 03:49:28,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=580276.6666666666, ans=0.2 2023-10-11 03:49:40,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=580323.3333333334, ans=0.125 2023-10-11 03:49:41,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=580323.3333333334, ans=0.125 2023-10-11 03:49:49,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=580370.0, ans=0.125 2023-10-11 03:49:50,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=580370.0, ans=0.035 2023-10-11 03:50:19,890 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-10-11 03:50:27,219 INFO [train.py:1031] (0/4) Epoch 10, batch 1500, loss[loss=0.1758, simple_loss=0.2627, pruned_loss=0.04444, over 16526.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2941, pruned_loss=0.05958, over 17304320.92 frames. ], batch size: 266, lr: 3.67e-03, grad_scale: 32.0 2023-10-11 03:50:32,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=580510.0, ans=0.04949747468305833 2023-10-11 03:50:55,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.317e+02 1.652e+02 1.830e+02 2.080e+02 3.313e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-11 03:50:56,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.21 vs. limit=10.0 2023-10-11 03:50:59,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=580603.3333333334, ans=0.0 2023-10-11 03:51:14,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=580696.6666666666, ans=0.1 2023-10-11 03:51:27,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2023-10-11 03:51:39,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=580790.0, ans=0.125 2023-10-11 03:51:43,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=580790.0, ans=0.125 2023-10-11 03:51:55,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=580836.6666666666, ans=0.125 2023-10-11 03:52:10,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=580883.3333333334, ans=0.125 2023-10-11 03:52:12,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.19 vs. limit=22.5 2023-10-11 03:52:12,149 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-10-11 03:52:13,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=580930.0, ans=22.5 2023-10-11 03:52:27,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=580976.6666666666, ans=0.125 2023-10-11 03:52:48,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.656e+02 1.897e+02 2.087e+02 2.611e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-11 03:52:49,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=581070.0, ans=0.125 2023-10-11 03:53:13,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=581163.3333333334, ans=0.125 2023-10-11 03:53:25,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=581163.3333333334, ans=0.2 2023-10-11 03:53:35,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=581210.0, ans=0.0 2023-10-11 03:53:49,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=581303.3333333334, ans=0.125 2023-10-11 03:53:56,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=581303.3333333334, ans=0.125 2023-10-11 03:53:58,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=581350.0, ans=0.2 2023-10-11 03:54:09,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=581350.0, ans=10.0 2023-10-11 03:54:12,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=581396.6666666666, ans=0.0 2023-10-11 03:54:16,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=581396.6666666666, ans=0.1 2023-10-11 03:54:28,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=581443.3333333334, ans=0.2 2023-10-11 03:54:44,894 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.719e+02 1.933e+02 2.196e+02 3.215e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-11 03:54:47,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=581536.6666666666, ans=0.0 2023-10-11 03:55:33,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2023-10-11 03:55:35,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=581723.3333333334, ans=0.2 2023-10-11 03:55:47,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=581770.0, ans=0.125 2023-10-11 03:55:49,417 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.89 vs. limit=15.0 2023-10-11 03:55:59,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.20 vs. limit=22.5 2023-10-11 03:56:02,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=581816.6666666666, ans=0.1 2023-10-11 03:56:13,172 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:56:36,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=582003.3333333334, ans=0.0 2023-10-11 03:56:40,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.658e+02 1.809e+02 1.985e+02 2.758e+02, threshold=3.618e+02, percent-clipped=0.0 2023-10-11 03:56:48,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-10-11 03:57:18,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=582143.3333333334, ans=0.125 2023-10-11 03:57:31,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-10-11 03:57:57,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=582330.0, ans=0.125 2023-10-11 03:58:10,592 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 03:58:15,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=582423.3333333334, ans=0.0 2023-10-11 03:58:26,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=582470.0, ans=0.125 2023-10-11 03:58:27,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.702e+02 1.875e+02 2.183e+02 2.801e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-11 03:58:36,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=582516.6666666666, ans=0.125 2023-10-11 03:59:11,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=582610.0, ans=0.125 2023-10-11 03:59:37,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=582703.3333333334, ans=0.035 2023-10-11 03:59:39,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=582703.3333333334, ans=0.1 2023-10-11 03:59:48,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=582750.0, ans=0.125 2023-10-11 03:59:54,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=582750.0, ans=0.125 2023-10-11 03:59:55,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=582750.0, ans=0.015 2023-10-11 03:59:58,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=582796.6666666666, ans=0.0 2023-10-11 04:00:10,870 INFO [train.py:1031] (0/4) Epoch 10, batch 2000, loss[loss=0.2098, simple_loss=0.3021, pruned_loss=0.05873, over 16990.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2953, pruned_loss=0.05972, over 20765111.95 frames. ], batch size: 93, lr: 3.66e-03, grad_scale: 64.0 2023-10-11 04:00:31,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.26 vs. limit=10.0 2023-10-11 04:00:39,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.337e+02 1.755e+02 1.912e+02 2.307e+02 3.265e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-11 04:01:05,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=583030.0, ans=0.2 2023-10-11 04:01:08,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=583030.0, ans=0.0 2023-10-11 04:01:17,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=583076.6666666666, ans=0.125 2023-10-11 04:01:29,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=583123.3333333334, ans=0.2 2023-10-11 04:01:29,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=583123.3333333334, ans=0.0 2023-10-11 04:01:42,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-10-11 04:01:56,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=583216.6666666666, ans=0.125 2023-10-11 04:02:01,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=583216.6666666666, ans=0.0 2023-10-11 04:02:08,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=583263.3333333334, ans=0.0 2023-10-11 04:02:10,720 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-10-11 04:03:02,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=583403.3333333334, ans=0.125 2023-10-11 04:03:04,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.284e+02 1.588e+02 1.730e+02 1.939e+02 2.885e+02, threshold=3.461e+02, percent-clipped=0.0 2023-10-11 04:03:19,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=583450.0, ans=0.125 2023-10-11 04:03:24,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=583450.0, ans=0.0 2023-10-11 04:03:25,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=583496.6666666666, ans=0.125 2023-10-11 04:03:34,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=583496.6666666666, ans=0.0 2023-10-11 04:03:44,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=583543.3333333334, ans=0.1 2023-10-11 04:03:56,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=583590.0, ans=0.0 2023-10-11 04:04:28,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=583730.0, ans=15.0 2023-10-11 04:04:32,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=583730.0, ans=0.125 2023-10-11 04:04:55,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=583823.3333333334, ans=0.2 2023-10-11 04:04:59,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=583870.0, ans=0.125 2023-10-11 04:05:01,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.720e+02 1.865e+02 2.049e+02 3.044e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-11 04:05:06,013 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-10-11 04:05:15,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=583916.6666666666, ans=0.1 2023-10-11 04:05:19,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=583963.3333333334, ans=0.1 2023-10-11 04:05:27,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=583963.3333333334, ans=0.0 2023-10-11 04:05:29,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=583963.3333333334, ans=0.0 2023-10-11 04:05:29,313 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:05:32,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=584010.0, ans=0.0 2023-10-11 04:05:43,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=584056.6666666666, ans=0.125 2023-10-11 04:05:51,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=584056.6666666666, ans=0.0 2023-10-11 04:05:56,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=584103.3333333334, ans=0.125 2023-10-11 04:05:58,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=584103.3333333334, ans=0.125 2023-10-11 04:06:02,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=584150.0, ans=0.0 2023-10-11 04:06:38,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=584290.0, ans=0.0 2023-10-11 04:06:48,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.723e+02 2.011e+02 2.239e+02 3.501e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-11 04:06:49,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=584336.6666666666, ans=0.125 2023-10-11 04:06:49,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=584336.6666666666, ans=0.125 2023-10-11 04:06:49,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=584336.6666666666, ans=0.2 2023-10-11 04:07:01,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=584383.3333333334, ans=0.04949747468305833 2023-10-11 04:07:02,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=584383.3333333334, ans=0.0 2023-10-11 04:07:03,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584383.3333333334, ans=0.1 2023-10-11 04:07:17,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=584476.6666666666, ans=0.0 2023-10-11 04:07:21,916 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:07:53,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=584616.6666666666, ans=0.025 2023-10-11 04:08:00,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=584616.6666666666, ans=0.125 2023-10-11 04:08:03,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=584663.3333333334, ans=0.125 2023-10-11 04:08:17,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=584710.0, ans=0.125 2023-10-11 04:08:40,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584803.3333333334, ans=0.1 2023-10-11 04:08:41,254 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.704e+02 2.016e+02 2.293e+02 3.044e+02, threshold=4.032e+02, percent-clipped=0.0 2023-10-11 04:08:43,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=584803.3333333334, ans=0.0 2023-10-11 04:08:47,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=584850.0, ans=0.125 2023-10-11 04:08:55,450 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:09:02,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=584896.6666666666, ans=0.125 2023-10-11 04:09:13,309 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.63 vs. limit=22.5 2023-10-11 04:09:14,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=584943.3333333334, ans=0.125 2023-10-11 04:09:21,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=584990.0, ans=0.2 2023-10-11 04:09:55,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=585130.0, ans=0.125 2023-10-11 04:10:02,553 INFO [train.py:1031] (0/4) Epoch 10, batch 2500, loss[loss=0.2225, simple_loss=0.3037, pruned_loss=0.07063, over 16670.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2955, pruned_loss=0.06006, over 23401958.10 frames. ], batch size: 61, lr: 3.65e-03, grad_scale: 32.0 2023-10-11 04:10:06,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=585176.6666666666, ans=0.1 2023-10-11 04:10:07,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=585176.6666666666, ans=0.125 2023-10-11 04:10:27,858 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.703e+02 1.941e+02 2.217e+02 3.338e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-11 04:10:33,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=585316.6666666666, ans=0.125 2023-10-11 04:10:35,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=585316.6666666666, ans=0.0 2023-10-11 04:10:44,671 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-10-11 04:10:57,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=585410.0, ans=0.125 2023-10-11 04:10:58,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=585410.0, ans=0.125 2023-10-11 04:11:06,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=585456.6666666666, ans=0.05 2023-10-11 04:11:11,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=585456.6666666666, ans=0.035 2023-10-11 04:11:15,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=585456.6666666666, ans=0.125 2023-10-11 04:11:39,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=585596.6666666666, ans=0.1 2023-10-11 04:12:07,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.93 vs. limit=12.0 2023-10-11 04:12:15,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=585736.6666666666, ans=0.1 2023-10-11 04:12:17,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.775e+02 1.953e+02 2.193e+02 3.268e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-11 04:12:19,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=585736.6666666666, ans=0.0 2023-10-11 04:12:19,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=585736.6666666666, ans=0.0 2023-10-11 04:12:46,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=585876.6666666666, ans=0.2 2023-10-11 04:12:58,940 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.12 vs. limit=15.0 2023-10-11 04:13:05,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=585923.3333333334, ans=0.0 2023-10-11 04:13:06,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=585970.0, ans=0.0 2023-10-11 04:13:10,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=585970.0, ans=0.2 2023-10-11 04:13:31,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=586063.3333333334, ans=0.125 2023-10-11 04:13:42,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=586110.0, ans=0.0 2023-10-11 04:13:48,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=586110.0, ans=0.125 2023-10-11 04:13:50,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=586110.0, ans=0.125 2023-10-11 04:13:56,777 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.77 vs. limit=15.0 2023-10-11 04:13:58,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=586156.6666666666, ans=0.125 2023-10-11 04:14:08,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=586203.3333333334, ans=0.125 2023-10-11 04:14:11,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.794e+02 1.980e+02 2.305e+02 3.152e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-11 04:14:22,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=586250.0, ans=0.125 2023-10-11 04:14:45,692 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.39 vs. limit=6.0 2023-10-11 04:14:51,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=586343.3333333334, ans=0.125 2023-10-11 04:14:56,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-10-11 04:15:10,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=586436.6666666666, ans=0.125 2023-10-11 04:15:16,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=586436.6666666666, ans=0.05 2023-10-11 04:15:29,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=586483.3333333334, ans=0.125 2023-10-11 04:15:53,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=586623.3333333334, ans=0.0 2023-10-11 04:16:08,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=586670.0, ans=0.1 2023-10-11 04:16:08,961 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.242e+02 1.697e+02 1.881e+02 2.195e+02 4.343e+02, threshold=3.761e+02, percent-clipped=2.0 2023-10-11 04:16:13,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=586670.0, ans=0.125 2023-10-11 04:16:23,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=586716.6666666666, ans=0.125 2023-10-11 04:16:34,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.08 vs. limit=15.0 2023-10-11 04:16:43,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=586810.0, ans=0.2 2023-10-11 04:16:47,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=586810.0, ans=0.05 2023-10-11 04:16:50,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=586810.0, ans=0.125 2023-10-11 04:16:51,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=586856.6666666666, ans=0.125 2023-10-11 04:17:17,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=586903.3333333334, ans=0.125 2023-10-11 04:17:17,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.49 vs. limit=22.5 2023-10-11 04:17:34,529 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.82 vs. limit=22.5 2023-10-11 04:17:45,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=587043.3333333334, ans=0.2 2023-10-11 04:18:13,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.671e+02 1.887e+02 2.150e+02 2.980e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-11 04:18:17,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=587136.6666666666, ans=0.0 2023-10-11 04:18:23,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=587183.3333333334, ans=0.0 2023-10-11 04:18:24,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=587183.3333333334, ans=0.125 2023-10-11 04:18:25,990 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-10-11 04:18:40,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=587276.6666666666, ans=0.0 2023-10-11 04:18:41,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=587276.6666666666, ans=0.2 2023-10-11 04:18:44,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=587276.6666666666, ans=0.0 2023-10-11 04:18:51,283 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:19:03,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=587370.0, ans=0.125 2023-10-11 04:19:05,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=587370.0, ans=0.0 2023-10-11 04:19:07,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=587370.0, ans=0.2 2023-10-11 04:19:16,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587416.6666666666, ans=0.1 2023-10-11 04:19:22,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-10-11 04:19:22,813 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-10-11 04:19:27,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=587463.3333333334, ans=0.0 2023-10-11 04:19:31,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=587463.3333333334, ans=0.125 2023-10-11 04:19:34,762 INFO [train.py:1031] (0/4) Epoch 10, batch 3000, loss[loss=0.2267, simple_loss=0.2792, pruned_loss=0.0871, over 12513.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2946, pruned_loss=0.06006, over 25450142.67 frames. ], batch size: 440, lr: 3.65e-03, grad_scale: 32.0 2023-10-11 04:19:45,161 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.33 vs. limit=22.5 2023-10-11 04:19:48,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=587556.6666666666, ans=0.125 2023-10-11 04:20:01,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.750e+02 2.010e+02 2.236e+02 3.709e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-11 04:20:36,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=587743.3333333334, ans=0.2 2023-10-11 04:20:56,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-10-11 04:21:03,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=587883.3333333334, ans=0.125 2023-10-11 04:21:08,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=587883.3333333334, ans=0.0 2023-10-11 04:21:32,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=587976.6666666666, ans=0.125 2023-10-11 04:21:52,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=588070.0, ans=0.125 2023-10-11 04:21:57,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=588070.0, ans=0.125 2023-10-11 04:21:58,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.718e+02 1.875e+02 2.075e+02 3.232e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-11 04:22:00,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=588070.0, ans=0.1 2023-10-11 04:22:15,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=588163.3333333334, ans=0.2 2023-10-11 04:22:20,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=588163.3333333334, ans=0.125 2023-10-11 04:22:42,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=588256.6666666666, ans=0.125 2023-10-11 04:22:52,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=588303.3333333334, ans=0.0 2023-10-11 04:22:56,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=588303.3333333334, ans=0.125 2023-10-11 04:23:12,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=588396.6666666666, ans=0.125 2023-10-11 04:23:16,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=588396.6666666666, ans=0.125 2023-10-11 04:23:46,008 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=22.5 2023-10-11 04:23:49,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.647e+02 1.817e+02 1.993e+02 2.964e+02, threshold=3.634e+02, percent-clipped=0.0 2023-10-11 04:24:19,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=588630.0, ans=0.125 2023-10-11 04:24:21,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=588676.6666666666, ans=0.5 2023-10-11 04:24:25,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=588676.6666666666, ans=0.0 2023-10-11 04:24:30,305 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.46 vs. limit=15.0 2023-10-11 04:24:33,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=588676.6666666666, ans=0.125 2023-10-11 04:24:37,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.51 vs. limit=15.0 2023-10-11 04:24:49,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=588770.0, ans=0.125 2023-10-11 04:24:58,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.41 vs. limit=15.0 2023-10-11 04:25:02,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=588816.6666666666, ans=0.0 2023-10-11 04:25:10,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=588816.6666666666, ans=0.2 2023-10-11 04:25:15,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=588863.3333333334, ans=0.125 2023-10-11 04:25:16,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=588863.3333333334, ans=0.1 2023-10-11 04:25:46,530 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=12.0 2023-10-11 04:25:51,542 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.756e+02 1.920e+02 2.202e+02 2.872e+02, threshold=3.839e+02, percent-clipped=0.0 2023-10-11 04:25:51,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=589003.3333333334, ans=0.0 2023-10-11 04:25:57,116 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.63 vs. limit=5.0 2023-10-11 04:26:49,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=589236.6666666666, ans=0.125 2023-10-11 04:26:49,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=589236.6666666666, ans=0.0 2023-10-11 04:27:00,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-10-11 04:27:01,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=589283.3333333334, ans=0.125 2023-10-11 04:27:26,337 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.66 vs. limit=15.0 2023-10-11 04:27:28,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=589423.3333333334, ans=0.0 2023-10-11 04:27:43,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=589470.0, ans=0.0 2023-10-11 04:27:47,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.679e+02 1.845e+02 2.040e+02 3.311e+02, threshold=3.690e+02, percent-clipped=0.0 2023-10-11 04:28:05,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.86 vs. limit=10.0 2023-10-11 04:28:08,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=589563.3333333334, ans=0.125 2023-10-11 04:28:16,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=589610.0, ans=0.125 2023-10-11 04:28:37,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=589703.3333333334, ans=0.125 2023-10-11 04:28:38,779 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.79 vs. limit=15.0 2023-10-11 04:28:42,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=589703.3333333334, ans=0.125 2023-10-11 04:28:47,086 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:29:12,677 INFO [train.py:1031] (0/4) Epoch 10, batch 3500, loss[loss=0.2113, simple_loss=0.299, pruned_loss=0.06178, over 16573.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2942, pruned_loss=0.05989, over 27055199.81 frames. ], batch size: 267, lr: 3.64e-03, grad_scale: 32.0 2023-10-11 04:29:14,006 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:29:40,770 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.691e+02 1.840e+02 2.001e+02 2.941e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-11 04:29:43,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.79 vs. limit=15.0 2023-10-11 04:29:48,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=589983.3333333334, ans=0.125 2023-10-11 04:30:04,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=590030.0, ans=0.1 2023-10-11 04:30:14,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=590076.6666666666, ans=0.0 2023-10-11 04:30:16,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=590076.6666666666, ans=0.125 2023-10-11 04:30:23,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=590123.3333333334, ans=0.025 2023-10-11 04:30:42,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=590170.0, ans=0.1 2023-10-11 04:30:42,524 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-10-11 04:31:02,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=590263.3333333334, ans=0.125 2023-10-11 04:31:03,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-11 04:31:08,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=590263.3333333334, ans=0.0 2023-10-11 04:31:11,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=590263.3333333334, ans=0.125 2023-10-11 04:31:12,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=590310.0, ans=0.2 2023-10-11 04:31:13,882 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-11 04:31:15,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=590310.0, ans=0.125 2023-10-11 04:31:16,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=590310.0, ans=0.0 2023-10-11 04:31:41,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.736e+02 1.937e+02 2.184e+02 2.613e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-11 04:32:12,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=590543.3333333334, ans=0.125 2023-10-11 04:32:21,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=590543.3333333334, ans=0.125 2023-10-11 04:32:35,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.85 vs. limit=15.0 2023-10-11 04:32:38,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=590636.6666666666, ans=0.0 2023-10-11 04:32:50,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=590683.3333333334, ans=0.05 2023-10-11 04:33:11,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=590776.6666666666, ans=0.05 2023-10-11 04:33:18,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=590776.6666666666, ans=0.125 2023-10-11 04:33:37,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=590870.0, ans=0.125 2023-10-11 04:33:38,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=590870.0, ans=0.04949747468305833 2023-10-11 04:33:44,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.669e+02 1.868e+02 2.127e+02 3.030e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-11 04:33:47,397 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-10-11 04:33:48,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=590870.0, ans=0.125 2023-10-11 04:33:50,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=590916.6666666666, ans=0.0 2023-10-11 04:34:04,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=590963.3333333334, ans=0.125 2023-10-11 04:34:05,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=590963.3333333334, ans=0.125 2023-10-11 04:34:05,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=590963.3333333334, ans=0.1 2023-10-11 04:34:13,691 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-10-11 04:34:22,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=591010.0, ans=0.125 2023-10-11 04:34:26,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=591056.6666666666, ans=0.125 2023-10-11 04:34:34,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=591056.6666666666, ans=0.1 2023-10-11 04:34:57,701 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.00 vs. limit=15.0 2023-10-11 04:35:16,867 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.63 vs. limit=15.0 2023-10-11 04:35:21,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=591243.3333333334, ans=0.0 2023-10-11 04:35:29,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=591290.0, ans=0.125 2023-10-11 04:35:36,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=591290.0, ans=0.125 2023-10-11 04:35:46,007 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.618e+02 1.775e+02 1.986e+02 3.331e+02, threshold=3.551e+02, percent-clipped=0.0 2023-10-11 04:35:47,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=591336.6666666666, ans=0.125 2023-10-11 04:35:49,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=591336.6666666666, ans=0.1 2023-10-11 04:35:56,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=591383.3333333334, ans=0.125 2023-10-11 04:36:09,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=591430.0, ans=0.125 2023-10-11 04:36:14,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=591476.6666666666, ans=0.2 2023-10-11 04:36:20,870 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=15.0 2023-10-11 04:36:25,032 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.98 vs. limit=22.5 2023-10-11 04:36:37,225 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-10-11 04:36:43,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=591570.0, ans=0.0 2023-10-11 04:36:50,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-10-11 04:36:53,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=591616.6666666666, ans=0.125 2023-10-11 04:36:56,724 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.22 vs. limit=10.0 2023-10-11 04:37:11,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=591710.0, ans=0.125 2023-10-11 04:37:17,301 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:37:36,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.703e+02 1.860e+02 2.243e+02 3.497e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-11 04:37:45,060 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-10-11 04:37:57,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=591896.6666666666, ans=0.1 2023-10-11 04:38:13,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=591990.0, ans=0.2 2023-10-11 04:38:19,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=591990.0, ans=0.05 2023-10-11 04:38:30,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.44 vs. limit=15.0 2023-10-11 04:38:35,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=592036.6666666666, ans=0.0 2023-10-11 04:38:52,234 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.22 vs. limit=15.0 2023-10-11 04:38:59,577 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:39:00,184 INFO [train.py:1031] (0/4) Epoch 10, batch 4000, loss[loss=0.2232, simple_loss=0.276, pruned_loss=0.08524, over 12147.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2935, pruned_loss=0.05982, over 28283713.26 frames. ], batch size: 440, lr: 3.63e-03, grad_scale: 32.0 2023-10-11 04:39:16,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=592223.3333333334, ans=0.125 2023-10-11 04:39:26,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=592270.0, ans=0.1 2023-10-11 04:39:29,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=592270.0, ans=0.125 2023-10-11 04:39:34,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.729e+02 1.898e+02 2.095e+02 2.866e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-11 04:39:51,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=592363.3333333334, ans=0.125 2023-10-11 04:39:54,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.79 vs. limit=22.5 2023-10-11 04:40:00,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=592410.0, ans=0.125 2023-10-11 04:40:12,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=592456.6666666666, ans=0.0 2023-10-11 04:40:14,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=592456.6666666666, ans=0.125 2023-10-11 04:40:17,895 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-10-11 04:40:21,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=592503.3333333334, ans=0.035 2023-10-11 04:40:27,368 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.13 vs. limit=15.0 2023-10-11 04:40:34,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=592550.0, ans=0.0 2023-10-11 04:40:37,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=592550.0, ans=0.125 2023-10-11 04:40:37,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.97 vs. limit=15.0 2023-10-11 04:40:56,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=592643.3333333334, ans=0.125 2023-10-11 04:41:00,364 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:41:16,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=592736.6666666666, ans=0.125 2023-10-11 04:41:25,032 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.699e+02 1.875e+02 2.140e+02 2.983e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-11 04:41:52,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=592876.6666666666, ans=0.125 2023-10-11 04:42:03,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=592876.6666666666, ans=0.125 2023-10-11 04:42:09,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=592923.3333333334, ans=0.125 2023-10-11 04:42:27,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=592970.0, ans=0.125 2023-10-11 04:42:34,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.18 vs. limit=15.0 2023-10-11 04:42:50,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.82 vs. limit=15.0 2023-10-11 04:42:50,598 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.82 vs. limit=15.0 2023-10-11 04:43:00,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=593063.3333333334, ans=0.0 2023-10-11 04:43:20,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=593156.6666666666, ans=0.125 2023-10-11 04:43:24,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.14 vs. limit=22.5 2023-10-11 04:43:28,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.32 vs. limit=8.0 2023-10-11 04:43:31,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=593203.3333333334, ans=0.2 2023-10-11 04:43:32,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.267e+02 1.699e+02 1.905e+02 2.275e+02 2.917e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-11 04:43:46,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=593250.0, ans=0.125 2023-10-11 04:43:47,950 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.32 vs. limit=6.0 2023-10-11 04:43:52,358 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.04 vs. limit=15.0 2023-10-11 04:43:56,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.08 vs. limit=22.5 2023-10-11 04:43:57,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=593296.6666666666, ans=0.125 2023-10-11 04:43:57,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=593296.6666666666, ans=0.125 2023-10-11 04:43:58,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=593296.6666666666, ans=22.5 2023-10-11 04:44:14,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=593390.0, ans=0.025 2023-10-11 04:44:44,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=593530.0, ans=0.2 2023-10-11 04:45:23,168 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.687e+02 1.869e+02 2.093e+02 2.887e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-11 04:45:39,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=593763.3333333334, ans=0.2 2023-10-11 04:45:47,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=593763.3333333334, ans=0.125 2023-10-11 04:46:35,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=593950.0, ans=0.2 2023-10-11 04:46:42,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=593996.6666666666, ans=0.2 2023-10-11 04:47:00,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=594090.0, ans=0.1 2023-10-11 04:47:01,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.73 vs. limit=22.5 2023-10-11 04:47:15,262 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.756e+02 2.018e+02 2.217e+02 3.636e+02, threshold=4.035e+02, percent-clipped=0.0 2023-10-11 04:47:28,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=594183.3333333334, ans=0.025 2023-10-11 04:47:30,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=594183.3333333334, ans=0.125 2023-10-11 04:47:45,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=594230.0, ans=0.0 2023-10-11 04:47:51,726 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.39 vs. limit=10.0 2023-10-11 04:48:17,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=594370.0, ans=0.05 2023-10-11 04:48:19,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=594370.0, ans=0.2 2023-10-11 04:48:21,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=594370.0, ans=0.125 2023-10-11 04:48:27,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=594416.6666666666, ans=0.125 2023-10-11 04:48:27,744 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.66 vs. limit=6.0 2023-10-11 04:48:32,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=594416.6666666666, ans=0.125 2023-10-11 04:48:50,384 INFO [train.py:1031] (0/4) Epoch 10, batch 4500, loss[loss=0.1993, simple_loss=0.2929, pruned_loss=0.05289, over 16903.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2941, pruned_loss=0.05963, over 29292913.71 frames. ], batch size: 188, lr: 3.63e-03, grad_scale: 16.0 2023-10-11 04:48:57,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=594510.0, ans=0.125 2023-10-11 04:48:58,995 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-10-11 04:49:15,588 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.06 vs. limit=10.0 2023-10-11 04:49:17,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.699e+02 1.963e+02 2.223e+02 2.946e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-11 04:49:26,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=594650.0, ans=0.0 2023-10-11 04:49:29,740 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.95 vs. limit=22.5 2023-10-11 04:49:40,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=594696.6666666666, ans=0.125 2023-10-11 04:49:52,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=594790.0, ans=0.125 2023-10-11 04:50:00,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=594790.0, ans=0.0 2023-10-11 04:50:17,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=594883.3333333334, ans=0.125 2023-10-11 04:50:20,931 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.58 vs. limit=10.0 2023-10-11 04:50:32,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-10-11 04:50:43,041 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-10-11 04:50:55,807 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:51:02,424 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.673e+02 1.883e+02 2.177e+02 3.004e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-11 04:51:08,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=595116.6666666666, ans=0.1 2023-10-11 04:51:11,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=595116.6666666666, ans=0.0 2023-10-11 04:51:20,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=595163.3333333334, ans=0.125 2023-10-11 04:51:27,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=595210.0, ans=0.2 2023-10-11 04:51:37,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=595256.6666666666, ans=0.125 2023-10-11 04:51:58,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-10-11 04:52:14,574 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:52:19,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=595396.6666666666, ans=0.125 2023-10-11 04:52:39,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=595490.0, ans=0.0 2023-10-11 04:52:40,993 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:52:48,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=595536.6666666666, ans=0.125 2023-10-11 04:52:51,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.701e+02 1.859e+02 2.118e+02 3.217e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-11 04:53:02,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=595583.3333333334, ans=0.0 2023-10-11 04:53:06,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=595630.0, ans=0.07 2023-10-11 04:53:09,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=595630.0, ans=0.125 2023-10-11 04:53:24,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=595676.6666666666, ans=0.125 2023-10-11 04:53:48,146 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.44 vs. limit=15.0 2023-10-11 04:53:51,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=595816.6666666666, ans=0.2 2023-10-11 04:54:22,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.21 vs. limit=12.0 2023-10-11 04:54:41,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.734e+02 1.903e+02 2.114e+02 2.782e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-11 04:55:04,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=596096.6666666666, ans=0.025 2023-10-11 04:55:05,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=596096.6666666666, ans=0.2 2023-10-11 04:55:07,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=596143.3333333334, ans=0.2 2023-10-11 04:55:19,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=596190.0, ans=0.125 2023-10-11 04:55:36,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=596236.6666666666, ans=0.0 2023-10-11 04:55:54,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=596330.0, ans=0.125 2023-10-11 04:55:58,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=596330.0, ans=0.125 2023-10-11 04:56:12,908 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-11 04:56:30,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=596470.0, ans=0.125 2023-10-11 04:56:35,256 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.747e+02 2.020e+02 2.351e+02 3.391e+02, threshold=4.040e+02, percent-clipped=0.0 2023-10-11 04:57:09,078 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-10-11 04:57:23,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=596656.6666666666, ans=0.0 2023-10-11 04:57:37,503 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 04:57:52,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=596796.6666666666, ans=0.0 2023-10-11 04:57:58,122 INFO [train.py:1031] (0/4) Epoch 10, batch 5000, loss[loss=0.1922, simple_loss=0.2515, pruned_loss=0.06645, over 12742.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2941, pruned_loss=0.0597, over 30086418.45 frames. ], batch size: 440, lr: 3.62e-03, grad_scale: 32.0 2023-10-11 04:58:00,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=596843.3333333334, ans=0.2 2023-10-11 04:58:29,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=596936.6666666666, ans=0.0 2023-10-11 04:58:29,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=596936.6666666666, ans=0.0 2023-10-11 04:58:29,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.751e+02 1.913e+02 2.182e+02 3.284e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-11 04:58:31,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=596936.6666666666, ans=0.125 2023-10-11 04:58:49,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=597030.0, ans=0.125 2023-10-11 04:58:55,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=597076.6666666666, ans=0.2 2023-10-11 04:59:00,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.15 vs. limit=12.0 2023-10-11 04:59:31,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=597216.6666666666, ans=0.2 2023-10-11 04:59:59,128 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-128000.pt 2023-10-11 05:00:08,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=597356.6666666666, ans=0.0 2023-10-11 05:00:09,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-10-11 05:00:29,003 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.798e+02 2.002e+02 2.254e+02 3.464e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-11 05:00:34,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=597450.0, ans=0.0 2023-10-11 05:00:47,487 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.86 vs. limit=12.0 2023-10-11 05:00:53,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=597496.6666666666, ans=0.125 2023-10-11 05:00:56,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=597543.3333333334, ans=0.0 2023-10-11 05:01:11,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=597590.0, ans=0.125 2023-10-11 05:01:12,853 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.95 vs. limit=15.0 2023-10-11 05:01:37,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=597683.3333333334, ans=0.0 2023-10-11 05:01:46,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=597730.0, ans=0.5 2023-10-11 05:01:56,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=597776.6666666666, ans=0.125 2023-10-11 05:02:00,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=597823.3333333334, ans=0.125 2023-10-11 05:02:12,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=597870.0, ans=0.125 2023-10-11 05:02:17,490 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=15.0 2023-10-11 05:02:20,546 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.700e+02 1.862e+02 2.114e+02 3.185e+02, threshold=3.724e+02, percent-clipped=0.0 2023-10-11 05:02:27,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=597916.6666666666, ans=0.125 2023-10-11 05:02:30,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=597916.6666666666, ans=0.1 2023-10-11 05:02:33,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=597963.3333333334, ans=0.1 2023-10-11 05:02:42,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=597963.3333333334, ans=0.2 2023-10-11 05:02:51,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=598010.0, ans=0.125 2023-10-11 05:02:53,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=598010.0, ans=0.125 2023-10-11 05:02:59,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=598056.6666666666, ans=0.1 2023-10-11 05:03:34,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=598196.6666666666, ans=0.125 2023-10-11 05:03:44,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=598196.6666666666, ans=0.125 2023-10-11 05:03:46,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.94 vs. limit=15.0 2023-10-11 05:03:51,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=598243.3333333334, ans=0.125 2023-10-11 05:03:55,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=598243.3333333334, ans=0.125 2023-10-11 05:04:07,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=598336.6666666666, ans=0.09899494936611666 2023-10-11 05:04:13,628 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.85 vs. limit=10.0 2023-10-11 05:04:15,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.668e+02 1.925e+02 2.208e+02 3.246e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-11 05:04:23,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=598383.3333333334, ans=0.125 2023-10-11 05:04:33,081 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:04:37,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-10-11 05:05:03,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=598523.3333333334, ans=0.125 2023-10-11 05:05:13,602 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.91 vs. limit=15.0 2023-10-11 05:05:23,430 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.34 vs. limit=22.5 2023-10-11 05:05:27,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=598616.6666666666, ans=0.125 2023-10-11 05:05:31,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=598663.3333333334, ans=0.125 2023-10-11 05:05:51,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=598756.6666666666, ans=0.0 2023-10-11 05:05:54,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=598756.6666666666, ans=0.0 2023-10-11 05:06:03,770 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.45 vs. limit=15.0 2023-10-11 05:06:09,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.240e+02 1.652e+02 1.774e+02 1.971e+02 2.737e+02, threshold=3.548e+02, percent-clipped=0.0 2023-10-11 05:06:11,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=598850.0, ans=0.1 2023-10-11 05:06:30,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=598896.6666666666, ans=0.125 2023-10-11 05:06:36,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=598943.3333333334, ans=0.125 2023-10-11 05:06:51,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=598990.0, ans=0.125 2023-10-11 05:07:03,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=599036.6666666666, ans=0.0 2023-10-11 05:07:06,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-10-11 05:07:14,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=599083.3333333334, ans=0.125 2023-10-11 05:07:15,337 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.47 vs. limit=15.0 2023-10-11 05:07:18,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=599130.0, ans=0.125 2023-10-11 05:07:25,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=599130.0, ans=0.1 2023-10-11 05:07:28,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=599130.0, ans=0.0 2023-10-11 05:07:29,643 INFO [train.py:1031] (0/4) Epoch 10, batch 5500, loss[loss=0.2033, simple_loss=0.2978, pruned_loss=0.05442, over 16611.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2939, pruned_loss=0.05952, over 30699027.94 frames. ], batch size: 241, lr: 3.61e-03, grad_scale: 32.0 2023-10-11 05:07:46,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.93 vs. limit=22.5 2023-10-11 05:07:58,586 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.643e+02 1.816e+02 2.015e+02 2.758e+02, threshold=3.632e+02, percent-clipped=0.0 2023-10-11 05:08:10,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=599363.3333333334, ans=0.125 2023-10-11 05:08:22,653 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.59 vs. limit=15.0 2023-10-11 05:08:28,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=599410.0, ans=0.125 2023-10-11 05:08:30,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=599410.0, ans=0.2 2023-10-11 05:08:53,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=599503.3333333334, ans=0.125 2023-10-11 05:08:54,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=599503.3333333334, ans=0.125 2023-10-11 05:09:00,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=599550.0, ans=0.125 2023-10-11 05:09:02,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=599550.0, ans=0.0 2023-10-11 05:09:04,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=599550.0, ans=0.125 2023-10-11 05:09:05,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=599550.0, ans=0.1 2023-10-11 05:09:09,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=599596.6666666666, ans=10.0 2023-10-11 05:09:15,982 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:09:20,554 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-10-11 05:09:25,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=599643.3333333334, ans=0.1 2023-10-11 05:09:26,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=599643.3333333334, ans=0.125 2023-10-11 05:09:49,792 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.718e+02 1.858e+02 2.090e+02 2.948e+02, threshold=3.716e+02, percent-clipped=0.0 2023-10-11 05:09:55,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=599783.3333333334, ans=0.1 2023-10-11 05:10:07,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=15.0 2023-10-11 05:10:20,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=599876.6666666666, ans=0.125 2023-10-11 05:10:23,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=599876.6666666666, ans=0.0 2023-10-11 05:10:26,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=599923.3333333334, ans=0.1 2023-10-11 05:10:36,535 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.55 vs. limit=15.0 2023-10-11 05:11:02,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.01 vs. limit=15.0 2023-10-11 05:11:15,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=600110.0, ans=0.0 2023-10-11 05:11:18,008 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-10-11 05:11:22,892 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:11:29,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=600156.6666666666, ans=0.1 2023-10-11 05:11:33,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=600156.6666666666, ans=0.0 2023-10-11 05:11:44,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=600203.3333333334, ans=0.125 2023-10-11 05:11:47,419 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-10-11 05:11:47,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.784e+02 1.975e+02 2.201e+02 3.508e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-11 05:12:05,032 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=15.0 2023-10-11 05:12:12,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=600343.3333333334, ans=0.0 2023-10-11 05:12:35,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=600436.6666666666, ans=0.0 2023-10-11 05:12:37,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=600436.6666666666, ans=0.0 2023-10-11 05:12:42,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=600436.6666666666, ans=0.125 2023-10-11 05:13:00,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=600530.0, ans=15.0 2023-10-11 05:13:09,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=600530.0, ans=0.0 2023-10-11 05:13:21,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=600623.3333333334, ans=0.125 2023-10-11 05:13:22,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=600623.3333333334, ans=0.1 2023-10-11 05:13:30,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=600623.3333333334, ans=0.125 2023-10-11 05:13:42,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.623e+02 1.798e+02 2.109e+02 3.115e+02, threshold=3.597e+02, percent-clipped=0.0 2023-10-11 05:13:48,653 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.96 vs. limit=10.0 2023-10-11 05:14:04,173 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:14:24,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=600856.6666666666, ans=0.1 2023-10-11 05:14:40,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=600950.0, ans=0.0 2023-10-11 05:14:44,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=600950.0, ans=0.125 2023-10-11 05:14:53,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=600996.6666666666, ans=0.2 2023-10-11 05:15:24,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=601136.6666666666, ans=0.125 2023-10-11 05:15:35,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.674e+02 1.862e+02 2.039e+02 2.917e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-11 05:15:53,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=601230.0, ans=0.0 2023-10-11 05:15:58,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.45 vs. limit=22.5 2023-10-11 05:16:32,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=601416.6666666666, ans=0.0 2023-10-11 05:16:33,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=601416.6666666666, ans=0.1 2023-10-11 05:16:36,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=601416.6666666666, ans=0.125 2023-10-11 05:16:36,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=601416.6666666666, ans=0.125 2023-10-11 05:16:36,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=12.0 2023-10-11 05:16:55,386 INFO [train.py:1031] (0/4) Epoch 10, batch 6000, loss[loss=0.2029, simple_loss=0.2885, pruned_loss=0.0587, over 16898.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2941, pruned_loss=0.05963, over 31179871.83 frames. ], batch size: 116, lr: 3.60e-03, grad_scale: 32.0 2023-10-11 05:16:57,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-10-11 05:17:24,140 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:17:28,726 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.689e+02 1.881e+02 2.114e+02 2.953e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-11 05:17:37,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=601650.0, ans=0.0 2023-10-11 05:17:37,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=601650.0, ans=0.125 2023-10-11 05:17:47,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=601696.6666666666, ans=0.2 2023-10-11 05:17:51,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=601696.6666666666, ans=0.0 2023-10-11 05:18:05,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=22.5 2023-10-11 05:18:05,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=601790.0, ans=0.04949747468305833 2023-10-11 05:18:10,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=601790.0, ans=0.2 2023-10-11 05:18:30,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=601883.3333333334, ans=0.125 2023-10-11 05:18:32,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=601883.3333333334, ans=0.0 2023-10-11 05:18:40,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=601930.0, ans=0.125 2023-10-11 05:18:42,867 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-10-11 05:18:43,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=601930.0, ans=0.125 2023-10-11 05:18:51,636 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-10-11 05:19:17,042 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.15 vs. limit=22.5 2023-10-11 05:19:18,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.680e+02 1.916e+02 2.120e+02 3.341e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-11 05:19:23,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=602116.6666666666, ans=0.2 2023-10-11 05:19:29,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=602116.6666666666, ans=0.05 2023-10-11 05:19:30,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=602163.3333333334, ans=0.1 2023-10-11 05:19:49,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=602210.0, ans=0.2 2023-10-11 05:20:17,653 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.62 vs. limit=15.0 2023-10-11 05:20:19,802 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.31 vs. limit=22.5 2023-10-11 05:20:39,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=602443.3333333334, ans=0.0 2023-10-11 05:20:50,482 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-10-11 05:20:51,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=602490.0, ans=0.0 2023-10-11 05:20:51,940 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.43 vs. limit=15.0 2023-10-11 05:21:09,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=602536.6666666666, ans=0.125 2023-10-11 05:21:10,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.725e+02 1.903e+02 2.051e+02 3.480e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-11 05:21:40,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=602676.6666666666, ans=0.1 2023-10-11 05:21:49,338 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=15.0 2023-10-11 05:22:13,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2023-10-11 05:22:37,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=602910.0, ans=0.1 2023-10-11 05:22:39,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-10-11 05:22:55,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=603003.3333333334, ans=0.1 2023-10-11 05:23:02,030 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.14 vs. limit=15.0 2023-10-11 05:23:05,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.781e+02 1.993e+02 2.269e+02 3.668e+02, threshold=3.985e+02, percent-clipped=0.0 2023-10-11 05:23:21,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=603096.6666666666, ans=0.0 2023-10-11 05:23:22,013 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.75 vs. limit=10.0 2023-10-11 05:23:30,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=603096.6666666666, ans=0.1 2023-10-11 05:23:38,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=603143.3333333334, ans=0.0 2023-10-11 05:23:42,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=603143.3333333334, ans=0.125 2023-10-11 05:23:57,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=603190.0, ans=0.1 2023-10-11 05:24:07,265 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.69 vs. limit=5.0 2023-10-11 05:24:10,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.14 vs. limit=15.0 2023-10-11 05:24:33,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=603330.0, ans=0.0 2023-10-11 05:24:43,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=603376.6666666666, ans=0.125 2023-10-11 05:24:54,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=603423.3333333334, ans=0.125 2023-10-11 05:25:07,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=603470.0, ans=0.1 2023-10-11 05:25:10,815 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.667e+02 1.902e+02 2.282e+02 4.065e+02, threshold=3.804e+02, percent-clipped=1.0 2023-10-11 05:25:26,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=603563.3333333334, ans=0.125 2023-10-11 05:26:31,212 INFO [train.py:1031] (0/4) Epoch 10, batch 6500, loss[loss=0.1958, simple_loss=0.2773, pruned_loss=0.05714, over 15500.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2947, pruned_loss=0.05983, over 31532143.05 frames. ], batch size: 35, lr: 3.60e-03, grad_scale: 32.0 2023-10-11 05:26:35,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=603843.3333333334, ans=0.125 2023-10-11 05:26:36,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=603843.3333333334, ans=0.125 2023-10-11 05:26:42,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=603843.3333333334, ans=0.0 2023-10-11 05:26:47,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=603890.0, ans=0.125 2023-10-11 05:27:10,108 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.719e+02 1.863e+02 2.138e+02 2.711e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-11 05:27:32,809 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-10-11 05:27:41,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=604076.6666666666, ans=0.0 2023-10-11 05:27:59,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=604123.3333333334, ans=0.125 2023-10-11 05:28:00,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=604123.3333333334, ans=0.125 2023-10-11 05:28:03,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=604170.0, ans=0.1 2023-10-11 05:28:11,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=604170.0, ans=0.125 2023-10-11 05:28:30,236 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.36 vs. limit=15.0 2023-10-11 05:28:33,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.20 vs. limit=12.0 2023-10-11 05:28:52,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=604356.6666666666, ans=0.0 2023-10-11 05:29:06,954 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.788e+02 1.944e+02 2.280e+02 3.318e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-11 05:29:08,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=604450.0, ans=0.125 2023-10-11 05:29:28,202 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.46 vs. limit=15.0 2023-10-11 05:29:54,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=604636.6666666666, ans=0.125 2023-10-11 05:29:56,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=604636.6666666666, ans=0.1 2023-10-11 05:30:02,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=604636.6666666666, ans=0.1 2023-10-11 05:30:14,360 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-10-11 05:30:50,878 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.05 vs. limit=15.0 2023-10-11 05:30:57,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=604870.0, ans=0.125 2023-10-11 05:30:58,581 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.688e+02 1.822e+02 2.015e+02 2.853e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-11 05:30:58,840 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:31:16,136 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.70 vs. limit=22.5 2023-10-11 05:31:27,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=605010.0, ans=0.125 2023-10-11 05:31:33,393 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=15.0 2023-10-11 05:31:37,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=605056.6666666666, ans=0.95 2023-10-11 05:31:46,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=605103.3333333334, ans=0.0 2023-10-11 05:32:01,455 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.72 vs. limit=12.0 2023-10-11 05:32:02,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=605150.0, ans=0.125 2023-10-11 05:32:42,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=605290.0, ans=0.125 2023-10-11 05:33:09,440 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.603e+02 1.742e+02 1.942e+02 3.490e+02, threshold=3.485e+02, percent-clipped=0.0 2023-10-11 05:33:14,535 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2023-10-11 05:33:31,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=605430.0, ans=0.0 2023-10-11 05:33:40,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=605476.6666666666, ans=0.1 2023-10-11 05:33:45,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=605523.3333333334, ans=0.125 2023-10-11 05:33:54,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=605523.3333333334, ans=0.2 2023-10-11 05:34:04,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=605570.0, ans=0.2 2023-10-11 05:34:04,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=605570.0, ans=0.125 2023-10-11 05:34:18,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=605616.6666666666, ans=0.1 2023-10-11 05:34:20,732 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-10-11 05:34:52,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=605803.3333333334, ans=0.2 2023-10-11 05:34:54,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=605803.3333333334, ans=0.0 2023-10-11 05:35:01,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.272e+02 1.657e+02 1.848e+02 2.233e+02 3.231e+02, threshold=3.697e+02, percent-clipped=0.0 2023-10-11 05:35:15,784 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-10-11 05:35:26,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=605943.3333333334, ans=0.07 2023-10-11 05:35:27,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=605943.3333333334, ans=0.2 2023-10-11 05:35:27,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-10-11 05:35:36,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=605990.0, ans=0.0 2023-10-11 05:35:57,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=606083.3333333334, ans=0.0 2023-10-11 05:36:01,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-10-11 05:36:16,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=606176.6666666666, ans=0.125 2023-10-11 05:36:17,222 INFO [train.py:1031] (0/4) Epoch 10, batch 7000, loss[loss=0.2233, simple_loss=0.313, pruned_loss=0.06678, over 16570.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2952, pruned_loss=0.05965, over 31834069.82 frames. ], batch size: 219, lr: 3.59e-03, grad_scale: 32.0 2023-10-11 05:36:18,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=606176.6666666666, ans=0.125 2023-10-11 05:36:29,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=606223.3333333334, ans=0.125 2023-10-11 05:36:29,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=606223.3333333334, ans=0.125 2023-10-11 05:36:37,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=606223.3333333334, ans=0.0 2023-10-11 05:36:48,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.70 vs. limit=22.5 2023-10-11 05:36:54,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.722e+02 1.898e+02 2.060e+02 2.695e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-11 05:36:58,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=606316.6666666666, ans=0.0 2023-10-11 05:37:03,184 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.95 vs. limit=15.0 2023-10-11 05:37:15,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=606363.3333333334, ans=0.125 2023-10-11 05:37:23,096 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.37 vs. limit=15.0 2023-10-11 05:37:24,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=606410.0, ans=0.0 2023-10-11 05:37:31,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=606456.6666666666, ans=0.125 2023-10-11 05:37:38,171 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.75 vs. limit=15.0 2023-10-11 05:37:57,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=606550.0, ans=0.125 2023-10-11 05:38:05,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=606596.6666666666, ans=0.125 2023-10-11 05:38:14,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=606643.3333333334, ans=0.05 2023-10-11 05:38:18,348 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.44 vs. limit=10.0 2023-10-11 05:38:19,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=606643.3333333334, ans=10.0 2023-10-11 05:38:26,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=606690.0, ans=0.0 2023-10-11 05:38:33,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=606736.6666666666, ans=0.2 2023-10-11 05:38:43,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.749e+02 2.024e+02 2.406e+02 3.274e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-11 05:38:58,867 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.31 vs. limit=10.0 2023-10-11 05:39:08,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=606876.6666666666, ans=0.125 2023-10-11 05:39:15,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=606876.6666666666, ans=0.0 2023-10-11 05:39:20,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=606923.3333333334, ans=0.2 2023-10-11 05:39:43,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=607016.6666666666, ans=0.025 2023-10-11 05:39:58,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.62 vs. limit=15.0 2023-10-11 05:40:48,767 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.763e+02 1.918e+02 2.092e+02 3.391e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-11 05:41:19,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=607343.3333333334, ans=0.0 2023-10-11 05:41:30,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=607390.0, ans=0.2 2023-10-11 05:41:38,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=607436.6666666666, ans=0.1 2023-10-11 05:42:11,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=607576.6666666666, ans=0.125 2023-10-11 05:42:26,903 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.56 vs. limit=15.0 2023-10-11 05:42:47,303 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.668e+02 1.796e+02 2.002e+02 2.838e+02, threshold=3.592e+02, percent-clipped=0.0 2023-10-11 05:43:06,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=607763.3333333334, ans=0.02 2023-10-11 05:43:14,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=607810.0, ans=0.1 2023-10-11 05:43:15,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=607810.0, ans=0.0 2023-10-11 05:43:35,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=607903.3333333334, ans=0.125 2023-10-11 05:44:13,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=608043.3333333334, ans=0.125 2023-10-11 05:44:20,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=608090.0, ans=0.0 2023-10-11 05:44:23,466 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:44:38,868 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.316e+02 1.745e+02 2.046e+02 2.441e+02 3.789e+02, threshold=4.092e+02, percent-clipped=3.0 2023-10-11 05:44:46,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=608183.3333333334, ans=0.0 2023-10-11 05:44:53,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=608230.0, ans=0.0 2023-10-11 05:44:53,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=608230.0, ans=0.1 2023-10-11 05:44:54,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=608230.0, ans=0.2 2023-10-11 05:45:00,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.41 vs. limit=15.0 2023-10-11 05:45:20,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=608323.3333333334, ans=0.125 2023-10-11 05:45:27,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=608370.0, ans=0.125 2023-10-11 05:45:31,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=608370.0, ans=0.125 2023-10-11 05:45:38,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=608416.6666666666, ans=0.125 2023-10-11 05:45:39,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=608416.6666666666, ans=0.125 2023-10-11 05:45:55,060 INFO [train.py:1031] (0/4) Epoch 10, batch 7500, loss[loss=0.2222, simple_loss=0.3083, pruned_loss=0.0681, over 16964.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2951, pruned_loss=0.05974, over 32049705.03 frames. ], batch size: 77, lr: 3.58e-03, grad_scale: 32.0 2023-10-11 05:45:58,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=608510.0, ans=0.0 2023-10-11 05:46:05,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=608556.6666666666, ans=0.2 2023-10-11 05:46:19,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=608603.3333333334, ans=0.125 2023-10-11 05:46:27,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=608603.3333333334, ans=0.125 2023-10-11 05:46:28,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.745e+02 1.943e+02 2.282e+02 3.984e+02, threshold=3.887e+02, percent-clipped=0.0 2023-10-11 05:46:28,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=608650.0, ans=0.125 2023-10-11 05:46:46,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=608696.6666666666, ans=0.0 2023-10-11 05:47:07,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=608790.0, ans=0.0 2023-10-11 05:47:12,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=608790.0, ans=0.2 2023-10-11 05:47:21,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=608836.6666666666, ans=0.1 2023-10-11 05:47:32,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=608883.3333333334, ans=0.2 2023-10-11 05:47:39,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=12.0 2023-10-11 05:47:49,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.34 vs. limit=15.0 2023-10-11 05:47:51,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=608976.6666666666, ans=0.125 2023-10-11 05:47:59,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=609023.3333333334, ans=0.125 2023-10-11 05:48:13,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=609070.0, ans=0.125 2023-10-11 05:48:16,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=609070.0, ans=10.0 2023-10-11 05:48:23,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=609070.0, ans=0.0 2023-10-11 05:48:23,978 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.676e+02 1.857e+02 2.112e+02 2.881e+02, threshold=3.714e+02, percent-clipped=0.0 2023-10-11 05:49:44,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=609396.6666666666, ans=0.0 2023-10-11 05:49:50,548 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.54 vs. limit=15.0 2023-10-11 05:50:24,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=609536.6666666666, ans=0.125 2023-10-11 05:50:27,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.636e+02 1.793e+02 2.043e+02 3.357e+02, threshold=3.586e+02, percent-clipped=0.0 2023-10-11 05:51:02,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=609723.3333333334, ans=0.0 2023-10-11 05:51:25,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=609816.6666666666, ans=0.04949747468305833 2023-10-11 05:51:41,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-10-11 05:51:50,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=609910.0, ans=0.0 2023-10-11 05:51:54,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=609956.6666666666, ans=0.125 2023-10-11 05:52:20,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.692e+02 1.850e+02 2.078e+02 3.142e+02, threshold=3.700e+02, percent-clipped=0.0 2023-10-11 05:52:29,818 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.57 vs. limit=22.5 2023-10-11 05:52:51,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=610143.3333333334, ans=0.0 2023-10-11 05:52:55,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=610190.0, ans=0.1 2023-10-11 05:52:57,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=610190.0, ans=0.125 2023-10-11 05:53:03,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=610190.0, ans=0.1 2023-10-11 05:53:04,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=610190.0, ans=0.0 2023-10-11 05:53:40,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.50 vs. limit=6.0 2023-10-11 05:54:15,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.740e+02 1.948e+02 2.213e+02 3.755e+02, threshold=3.896e+02, percent-clipped=1.0 2023-10-11 05:54:47,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=610610.0, ans=0.0 2023-10-11 05:55:01,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=610656.6666666666, ans=0.0 2023-10-11 05:55:07,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=610703.3333333334, ans=0.1 2023-10-11 05:55:37,074 INFO [train.py:1031] (0/4) Epoch 10, batch 8000, loss[loss=0.1765, simple_loss=0.2759, pruned_loss=0.03859, over 16975.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2944, pruned_loss=0.05923, over 32196267.52 frames. ], batch size: 93, lr: 3.58e-03, grad_scale: 32.0 2023-10-11 05:55:41,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=610843.3333333334, ans=0.1 2023-10-11 05:55:50,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.20 vs. limit=22.5 2023-10-11 05:56:00,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=610936.6666666666, ans=0.125 2023-10-11 05:56:03,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.11 vs. limit=15.0 2023-10-11 05:56:11,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.630e+02 1.875e+02 2.146e+02 3.389e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-11 05:56:11,702 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.16 vs. limit=22.5 2023-10-11 05:56:23,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=611030.0, ans=0.125 2023-10-11 05:56:29,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=611030.0, ans=0.025 2023-10-11 05:56:45,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=611123.3333333334, ans=0.1 2023-10-11 05:56:49,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=611123.3333333334, ans=0.1 2023-10-11 05:57:17,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=611263.3333333334, ans=0.07 2023-10-11 05:57:21,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=611263.3333333334, ans=0.04949747468305833 2023-10-11 05:57:24,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=611310.0, ans=0.1 2023-10-11 05:57:24,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.96 vs. limit=15.0 2023-10-11 05:57:44,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=611403.3333333334, ans=0.0 2023-10-11 05:57:56,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.719e+02 1.872e+02 2.136e+02 2.984e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-11 05:57:59,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=611450.0, ans=0.0 2023-10-11 05:58:01,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=611450.0, ans=0.125 2023-10-11 05:58:05,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=611450.0, ans=0.0 2023-10-11 05:58:09,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=611496.6666666666, ans=0.125 2023-10-11 05:58:31,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.95 vs. limit=15.0 2023-10-11 05:58:35,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=611543.3333333334, ans=0.07 2023-10-11 05:58:53,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=611636.6666666666, ans=0.025 2023-10-11 05:58:55,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=611636.6666666666, ans=0.125 2023-10-11 05:58:57,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=611636.6666666666, ans=0.0 2023-10-11 05:59:02,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=611636.6666666666, ans=0.125 2023-10-11 05:59:07,785 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-10-11 05:59:15,009 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 05:59:27,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=611730.0, ans=15.0 2023-10-11 05:59:29,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=611730.0, ans=0.125 2023-10-11 05:59:30,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=611776.6666666666, ans=0.1 2023-10-11 05:59:36,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=611776.6666666666, ans=0.125 2023-10-11 05:59:55,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=611870.0, ans=0.125 2023-10-11 06:00:00,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=611870.0, ans=0.1 2023-10-11 06:00:04,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.712e+02 1.885e+02 2.119e+02 3.410e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-11 06:00:10,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.18 vs. limit=15.0 2023-10-11 06:00:23,789 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=22.5 2023-10-11 06:00:50,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=612103.3333333334, ans=0.125 2023-10-11 06:00:53,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=612103.3333333334, ans=0.1 2023-10-11 06:01:01,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=612150.0, ans=0.125 2023-10-11 06:01:01,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=612150.0, ans=0.125 2023-10-11 06:01:08,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=612196.6666666666, ans=0.0 2023-10-11 06:01:14,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=612196.6666666666, ans=0.0 2023-10-11 06:01:36,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=12.0 2023-10-11 06:01:39,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=612290.0, ans=0.2 2023-10-11 06:01:45,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=612290.0, ans=0.2 2023-10-11 06:01:59,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.618e+02 1.812e+02 2.014e+02 2.615e+02, threshold=3.624e+02, percent-clipped=0.0 2023-10-11 06:02:12,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=612430.0, ans=0.125 2023-10-11 06:02:18,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=612430.0, ans=0.1 2023-10-11 06:02:31,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=612523.3333333334, ans=0.125 2023-10-11 06:02:35,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=612523.3333333334, ans=0.125 2023-10-11 06:02:39,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=612523.3333333334, ans=0.2 2023-10-11 06:02:44,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.97 vs. limit=22.5 2023-10-11 06:03:37,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=612756.6666666666, ans=0.125 2023-10-11 06:03:57,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.654e+02 1.751e+02 1.907e+02 2.922e+02, threshold=3.502e+02, percent-clipped=0.0 2023-10-11 06:04:01,206 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.33 vs. limit=22.5 2023-10-11 06:04:11,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612896.6666666666, ans=0.1 2023-10-11 06:04:24,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-10-11 06:04:29,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=612990.0, ans=0.125 2023-10-11 06:04:31,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.77 vs. limit=15.0 2023-10-11 06:04:33,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=612990.0, ans=0.125 2023-10-11 06:05:17,900 INFO [train.py:1031] (0/4) Epoch 10, batch 8500, loss[loss=0.1882, simple_loss=0.2775, pruned_loss=0.04939, over 16906.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2947, pruned_loss=0.05923, over 32339509.38 frames. ], batch size: 77, lr: 3.57e-03, grad_scale: 32.0 2023-10-11 06:05:23,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=613176.6666666666, ans=0.125 2023-10-11 06:05:27,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=613176.6666666666, ans=0.125 2023-10-11 06:05:34,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=613223.3333333334, ans=0.1 2023-10-11 06:05:41,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=613270.0, ans=0.125 2023-10-11 06:05:53,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=613316.6666666666, ans=12.0 2023-10-11 06:05:54,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.810e+02 2.011e+02 2.393e+02 3.434e+02, threshold=4.022e+02, percent-clipped=0.0 2023-10-11 06:05:55,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=613316.6666666666, ans=0.125 2023-10-11 06:05:56,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=613316.6666666666, ans=0.125 2023-10-11 06:06:12,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=613363.3333333334, ans=0.1 2023-10-11 06:06:16,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=613410.0, ans=0.1 2023-10-11 06:06:18,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=613410.0, ans=0.0 2023-10-11 06:06:22,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=613410.0, ans=0.125 2023-10-11 06:06:41,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=613503.3333333334, ans=0.0 2023-10-11 06:06:47,675 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=15.0 2023-10-11 06:06:49,577 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.06 vs. limit=22.5 2023-10-11 06:06:56,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=613550.0, ans=0.125 2023-10-11 06:07:10,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=15.0 2023-10-11 06:07:28,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=613643.3333333334, ans=0.125 2023-10-11 06:07:36,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=613690.0, ans=0.0 2023-10-11 06:07:59,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=613783.3333333334, ans=0.125 2023-10-11 06:08:00,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.767e+02 1.981e+02 2.298e+02 3.299e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-11 06:08:02,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=613783.3333333334, ans=0.125 2023-10-11 06:08:04,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=613783.3333333334, ans=0.2 2023-10-11 06:08:06,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=613783.3333333334, ans=0.0 2023-10-11 06:08:17,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=613830.0, ans=0.0 2023-10-11 06:08:27,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=613876.6666666666, ans=0.05 2023-10-11 06:08:28,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=613876.6666666666, ans=0.2 2023-10-11 06:08:33,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=613923.3333333334, ans=0.5 2023-10-11 06:08:34,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=613923.3333333334, ans=0.125 2023-10-11 06:09:01,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=614016.6666666666, ans=0.0 2023-10-11 06:09:06,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=614016.6666666666, ans=0.0 2023-10-11 06:09:11,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.47 vs. limit=22.5 2023-10-11 06:09:12,541 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:09:15,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=614063.3333333334, ans=0.125 2023-10-11 06:09:21,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=614063.3333333334, ans=0.125 2023-10-11 06:09:42,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=614156.6666666666, ans=0.125 2023-10-11 06:09:44,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=614156.6666666666, ans=0.125 2023-10-11 06:09:47,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=614203.3333333334, ans=0.0 2023-10-11 06:09:55,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.67 vs. limit=6.0 2023-10-11 06:10:01,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.291e+02 1.620e+02 1.804e+02 2.266e+02 2.962e+02, threshold=3.607e+02, percent-clipped=0.0 2023-10-11 06:10:04,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=614250.0, ans=0.0 2023-10-11 06:10:07,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=614250.0, ans=0.125 2023-10-11 06:10:08,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=614250.0, ans=0.0 2023-10-11 06:10:20,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.32 vs. limit=15.0 2023-10-11 06:10:22,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=614343.3333333334, ans=0.125 2023-10-11 06:10:38,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=614390.0, ans=0.125 2023-10-11 06:10:43,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=614390.0, ans=0.2 2023-10-11 06:11:02,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=614483.3333333334, ans=0.0 2023-10-11 06:11:04,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=614483.3333333334, ans=0.1 2023-10-11 06:11:04,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=614483.3333333334, ans=0.2 2023-10-11 06:11:06,604 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-10-11 06:11:11,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=614530.0, ans=0.125 2023-10-11 06:11:48,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=614670.0, ans=0.0 2023-10-11 06:11:59,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.593e+02 1.785e+02 2.016e+02 2.752e+02, threshold=3.570e+02, percent-clipped=0.0 2023-10-11 06:11:59,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=614716.6666666666, ans=0.125 2023-10-11 06:12:07,180 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.64 vs. limit=12.0 2023-10-11 06:12:12,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=614763.3333333334, ans=0.1 2023-10-11 06:12:34,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=614856.6666666666, ans=0.125 2023-10-11 06:12:46,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=614903.3333333334, ans=0.0 2023-10-11 06:12:51,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=614950.0, ans=0.125 2023-10-11 06:12:56,435 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-10-11 06:13:00,098 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.571e-03 2023-10-11 06:13:03,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=614996.6666666666, ans=0.0 2023-10-11 06:13:04,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.96 vs. limit=10.0 2023-10-11 06:13:24,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=615043.3333333334, ans=0.0 2023-10-11 06:13:29,212 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.94 vs. limit=10.0 2023-10-11 06:13:41,031 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:13:41,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=615136.6666666666, ans=0.07 2023-10-11 06:13:49,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.677e+02 1.899e+02 2.179e+02 3.053e+02, threshold=3.799e+02, percent-clipped=0.0 2023-10-11 06:13:55,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=615183.3333333334, ans=0.95 2023-10-11 06:14:03,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.24 vs. limit=22.5 2023-10-11 06:14:04,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=615230.0, ans=0.0 2023-10-11 06:14:06,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=615230.0, ans=0.125 2023-10-11 06:14:19,071 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.34 vs. limit=6.0 2023-10-11 06:14:20,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=615323.3333333334, ans=0.125 2023-10-11 06:14:25,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=615323.3333333334, ans=0.0 2023-10-11 06:14:43,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=615416.6666666666, ans=0.125 2023-10-11 06:14:54,943 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.56 vs. limit=6.0 2023-10-11 06:15:05,571 INFO [train.py:1031] (0/4) Epoch 10, batch 9000, loss[loss=0.1996, simple_loss=0.2937, pruned_loss=0.05271, over 16840.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.294, pruned_loss=0.05888, over 32443963.35 frames. ], batch size: 72, lr: 3.56e-03, grad_scale: 32.0 2023-10-11 06:15:22,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=615556.6666666666, ans=0.125 2023-10-11 06:15:29,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=615603.3333333334, ans=0.5 2023-10-11 06:15:40,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=615650.0, ans=0.0 2023-10-11 06:15:40,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.700e+02 1.912e+02 2.198e+02 4.599e+02, threshold=3.824e+02, percent-clipped=1.0 2023-10-11 06:16:01,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=615743.3333333334, ans=0.125 2023-10-11 06:16:08,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=615743.3333333334, ans=0.125 2023-10-11 06:16:08,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.77 vs. limit=12.0 2023-10-11 06:16:09,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=615743.3333333334, ans=0.125 2023-10-11 06:16:11,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=615790.0, ans=0.125 2023-10-11 06:16:21,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=615836.6666666666, ans=0.04949747468305833 2023-10-11 06:16:46,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=615930.0, ans=0.0 2023-10-11 06:16:59,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-10-11 06:17:01,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=615976.6666666666, ans=0.125 2023-10-11 06:17:27,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=616116.6666666666, ans=0.0 2023-10-11 06:17:29,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.663e+02 1.884e+02 2.085e+02 2.906e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-11 06:17:30,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=616116.6666666666, ans=0.2 2023-10-11 06:17:44,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=616163.3333333334, ans=0.0 2023-10-11 06:17:45,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=616163.3333333334, ans=0.1 2023-10-11 06:17:54,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=616210.0, ans=0.125 2023-10-11 06:17:55,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=616210.0, ans=0.0 2023-10-11 06:18:05,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=616256.6666666666, ans=0.0 2023-10-11 06:18:35,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=616396.6666666666, ans=0.0 2023-10-11 06:19:04,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=616536.6666666666, ans=0.07 2023-10-11 06:19:13,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=616583.3333333334, ans=0.125 2023-10-11 06:19:16,568 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.705e+02 1.934e+02 2.142e+02 3.613e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-11 06:19:43,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=616676.6666666666, ans=0.2 2023-10-11 06:19:55,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=616770.0, ans=0.04949747468305833 2023-10-11 06:20:07,507 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2023-10-11 06:20:11,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.81 vs. limit=5.0 2023-10-11 06:20:22,447 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.91 vs. limit=10.0 2023-10-11 06:20:26,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=616863.3333333334, ans=0.125 2023-10-11 06:20:36,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-10-11 06:20:52,621 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-10-11 06:21:06,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.712e+02 1.920e+02 2.200e+02 3.268e+02, threshold=3.841e+02, percent-clipped=0.0 2023-10-11 06:21:31,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=617143.3333333334, ans=0.125 2023-10-11 06:21:32,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=617143.3333333334, ans=0.0 2023-10-11 06:21:41,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=617190.0, ans=0.0 2023-10-11 06:21:54,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.13 vs. limit=15.0 2023-10-11 06:22:40,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=617376.6666666666, ans=0.125 2023-10-11 06:22:51,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=617423.3333333334, ans=0.125 2023-10-11 06:22:57,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=15.0 2023-10-11 06:23:00,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=617470.0, ans=0.09899494936611666 2023-10-11 06:23:10,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.684e+02 1.913e+02 2.210e+02 3.109e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-11 06:24:00,703 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-10-11 06:24:06,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=617703.3333333334, ans=0.0 2023-10-11 06:24:13,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=617750.0, ans=15.0 2023-10-11 06:24:32,684 INFO [train.py:1031] (0/4) Epoch 10, batch 9500, loss[loss=0.1894, simple_loss=0.2827, pruned_loss=0.04802, over 16902.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2946, pruned_loss=0.05914, over 32517890.99 frames. ], batch size: 72, lr: 3.56e-03, grad_scale: 32.0 2023-10-11 06:24:33,941 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:24:41,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=617843.3333333334, ans=0.0 2023-10-11 06:24:45,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=617890.0, ans=0.1 2023-10-11 06:24:47,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=617890.0, ans=0.1 2023-10-11 06:25:05,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=617983.3333333334, ans=0.125 2023-10-11 06:25:08,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.716e+02 1.931e+02 2.238e+02 3.502e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-11 06:25:09,353 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.91 vs. limit=22.5 2023-10-11 06:25:09,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=617983.3333333334, ans=0.125 2023-10-11 06:25:22,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=618030.0, ans=0.125 2023-10-11 06:25:45,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=618123.3333333334, ans=0.125 2023-10-11 06:25:59,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=618170.0, ans=0.125 2023-10-11 06:26:00,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=618170.0, ans=0.1 2023-10-11 06:26:03,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=618216.6666666666, ans=0.0 2023-10-11 06:26:19,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=618263.3333333334, ans=0.07 2023-10-11 06:26:44,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=618356.6666666666, ans=0.2 2023-10-11 06:26:49,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=618356.6666666666, ans=0.125 2023-10-11 06:27:03,410 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.745e+02 2.009e+02 2.244e+02 2.944e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-11 06:27:03,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=618450.0, ans=0.1 2023-10-11 06:27:19,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=618496.6666666666, ans=0.2 2023-10-11 06:27:46,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=618590.0, ans=0.035 2023-10-11 06:28:02,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=618683.3333333334, ans=0.125 2023-10-11 06:28:23,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=618776.6666666666, ans=0.125 2023-10-11 06:28:25,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=618776.6666666666, ans=0.125 2023-10-11 06:28:25,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=618776.6666666666, ans=0.125 2023-10-11 06:28:34,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=618823.3333333334, ans=0.125 2023-10-11 06:28:35,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=618823.3333333334, ans=0.09899494936611666 2023-10-11 06:28:43,236 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=22.5 2023-10-11 06:28:56,271 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.705e+02 1.908e+02 2.181e+02 3.387e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-11 06:29:01,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=618916.6666666666, ans=0.0 2023-10-11 06:29:11,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=618963.3333333334, ans=0.0 2023-10-11 06:29:14,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2023-10-11 06:29:22,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=619010.0, ans=10.0 2023-10-11 06:29:23,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=619010.0, ans=0.125 2023-10-11 06:29:27,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=619056.6666666666, ans=0.125 2023-10-11 06:29:27,231 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-10-11 06:29:53,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=619150.0, ans=0.125 2023-10-11 06:29:56,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=619150.0, ans=0.125 2023-10-11 06:30:42,735 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.65 vs. limit=15.0 2023-10-11 06:30:50,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.270e+02 1.622e+02 1.773e+02 1.984e+02 2.645e+02, threshold=3.547e+02, percent-clipped=0.0 2023-10-11 06:30:54,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=619383.3333333334, ans=0.0 2023-10-11 06:30:58,685 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-10-11 06:30:59,376 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:31:07,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.14 vs. limit=15.0 2023-10-11 06:31:16,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=619476.6666666666, ans=0.125 2023-10-11 06:31:37,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=619570.0, ans=0.125 2023-10-11 06:32:18,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=619710.0, ans=0.2 2023-10-11 06:32:32,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=619803.3333333334, ans=0.125 2023-10-11 06:32:44,363 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.704e+02 1.851e+02 2.093e+02 2.869e+02, threshold=3.702e+02, percent-clipped=0.0 2023-10-11 06:32:59,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=619896.6666666666, ans=0.125 2023-10-11 06:33:06,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-10-11 06:33:21,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=619990.0, ans=0.125 2023-10-11 06:33:30,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=620036.6666666666, ans=0.125 2023-10-11 06:33:38,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=620083.3333333334, ans=0.125 2023-10-11 06:33:41,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=620083.3333333334, ans=0.125 2023-10-11 06:33:47,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=620130.0, ans=0.125 2023-10-11 06:33:56,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=620130.0, ans=0.125 2023-10-11 06:33:56,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=620130.0, ans=0.125 2023-10-11 06:33:58,153 INFO [train.py:1031] (0/4) Epoch 10, batch 10000, loss[loss=0.2067, simple_loss=0.2934, pruned_loss=0.05998, over 16927.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2936, pruned_loss=0.05875, over 32555850.55 frames. ], batch size: 138, lr: 3.55e-03, grad_scale: 32.0 2023-10-11 06:34:00,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=620176.6666666666, ans=0.2 2023-10-11 06:34:05,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=620176.6666666666, ans=0.0 2023-10-11 06:34:18,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=620223.3333333334, ans=0.1 2023-10-11 06:34:21,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=620270.0, ans=0.125 2023-10-11 06:34:23,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=620270.0, ans=0.125 2023-10-11 06:34:29,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=620270.0, ans=0.125 2023-10-11 06:34:32,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.637e+02 1.878e+02 2.151e+02 3.015e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-11 06:34:37,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=620316.6666666666, ans=0.0 2023-10-11 06:34:38,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=620316.6666666666, ans=0.125 2023-10-11 06:34:38,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=620316.6666666666, ans=0.2 2023-10-11 06:35:37,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=620550.0, ans=0.0 2023-10-11 06:36:15,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.75 vs. limit=22.5 2023-10-11 06:36:31,824 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.739e+02 1.927e+02 2.255e+02 3.167e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-11 06:36:34,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=620783.3333333334, ans=0.125 2023-10-11 06:36:57,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=620876.6666666666, ans=0.125 2023-10-11 06:37:06,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=620923.3333333334, ans=0.125 2023-10-11 06:37:12,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=620970.0, ans=0.2 2023-10-11 06:37:34,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=621063.3333333334, ans=0.125 2023-10-11 06:37:47,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=621110.0, ans=0.1 2023-10-11 06:38:18,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=621203.3333333334, ans=0.1 2023-10-11 06:38:18,640 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.915e-03 2023-10-11 06:38:27,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.640e+02 1.811e+02 2.026e+02 2.639e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-11 06:38:31,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=621250.0, ans=0.125 2023-10-11 06:38:31,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=621250.0, ans=0.125 2023-10-11 06:38:35,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=621296.6666666666, ans=0.125 2023-10-11 06:38:39,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=621296.6666666666, ans=0.125 2023-10-11 06:38:43,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=621296.6666666666, ans=0.125 2023-10-11 06:39:34,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=621530.0, ans=10.0 2023-10-11 06:39:44,610 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.01 vs. limit=15.0 2023-10-11 06:40:06,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=621623.3333333334, ans=0.125 2023-10-11 06:40:10,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=621670.0, ans=0.125 2023-10-11 06:40:22,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=12.0 2023-10-11 06:40:24,034 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.676e+02 1.865e+02 2.145e+02 3.086e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-11 06:40:27,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=621716.6666666666, ans=0.125 2023-10-11 06:40:29,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=621716.6666666666, ans=0.0 2023-10-11 06:40:42,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=621763.3333333334, ans=0.0 2023-10-11 06:41:15,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=621903.3333333334, ans=0.1 2023-10-11 06:41:16,191 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:41:19,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=621903.3333333334, ans=0.0 2023-10-11 06:41:21,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=621950.0, ans=0.125 2023-10-11 06:41:55,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=622043.3333333334, ans=0.125 2023-10-11 06:41:55,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=622043.3333333334, ans=0.0 2023-10-11 06:42:22,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.744e+02 1.959e+02 2.147e+02 3.117e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-11 06:42:25,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=622183.3333333334, ans=0.125 2023-10-11 06:42:28,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=622183.3333333334, ans=0.2 2023-10-11 06:42:49,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.78 vs. limit=10.0 2023-10-11 06:42:54,809 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.28 vs. limit=10.0 2023-10-11 06:43:05,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-10-11 06:43:22,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=622416.6666666666, ans=0.125 2023-10-11 06:43:40,676 INFO [train.py:1031] (0/4) Epoch 10, batch 10500, loss[loss=0.202, simple_loss=0.2914, pruned_loss=0.05633, over 16496.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2942, pruned_loss=0.05892, over 32613175.32 frames. ], batch size: 266, lr: 3.54e-03, grad_scale: 32.0 2023-10-11 06:43:43,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.72 vs. limit=8.0 2023-10-11 06:43:55,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=622556.6666666666, ans=0.1 2023-10-11 06:44:13,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.658e+02 1.819e+02 2.084e+02 2.855e+02, threshold=3.638e+02, percent-clipped=0.0 2023-10-11 06:44:24,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.95 vs. limit=15.0 2023-10-11 06:44:30,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=622696.6666666666, ans=0.0 2023-10-11 06:45:10,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=622836.6666666666, ans=0.1 2023-10-11 06:45:16,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=622883.3333333334, ans=0.125 2023-10-11 06:45:47,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=622976.6666666666, ans=0.1 2023-10-11 06:46:09,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=623070.0, ans=0.07 2023-10-11 06:46:11,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-10-11 06:46:16,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=623116.6666666666, ans=0.2 2023-10-11 06:46:16,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=623116.6666666666, ans=0.0 2023-10-11 06:46:16,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.645e+02 1.826e+02 2.026e+02 2.781e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-11 06:46:31,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=623163.3333333334, ans=0.2 2023-10-11 06:46:45,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=623210.0, ans=0.125 2023-10-11 06:46:49,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=623256.6666666666, ans=0.1 2023-10-11 06:47:32,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=623396.6666666666, ans=0.125 2023-10-11 06:48:01,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=623536.6666666666, ans=0.125 2023-10-11 06:48:03,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=623536.6666666666, ans=0.125 2023-10-11 06:48:09,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=623583.3333333334, ans=0.125 2023-10-11 06:48:09,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=623583.3333333334, ans=0.125 2023-10-11 06:48:13,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.685e+02 1.855e+02 2.215e+02 3.002e+02, threshold=3.709e+02, percent-clipped=0.0 2023-10-11 06:48:18,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=623583.3333333334, ans=0.125 2023-10-11 06:48:35,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=623676.6666666666, ans=0.2 2023-10-11 06:48:38,687 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:49:13,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=623816.6666666666, ans=0.015 2023-10-11 06:49:22,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=623863.3333333334, ans=0.125 2023-10-11 06:49:31,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=623910.0, ans=0.0 2023-10-11 06:49:40,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=623956.6666666666, ans=0.125 2023-10-11 06:49:44,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=623956.6666666666, ans=0.125 2023-10-11 06:49:51,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.37 vs. limit=10.0 2023-10-11 06:50:03,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=624050.0, ans=0.2 2023-10-11 06:50:07,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.826e+02 2.002e+02 2.400e+02 3.195e+02, threshold=4.005e+02, percent-clipped=0.0 2023-10-11 06:50:19,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2023-10-11 06:50:36,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=624190.0, ans=0.07 2023-10-11 06:51:00,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=624283.3333333334, ans=0.0 2023-10-11 06:51:03,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=624283.3333333334, ans=0.125 2023-10-11 06:51:11,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=624330.0, ans=0.2 2023-10-11 06:51:14,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=624330.0, ans=0.125 2023-10-11 06:51:33,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=624376.6666666666, ans=0.125 2023-10-11 06:52:01,231 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.574e+02 1.743e+02 1.911e+02 2.805e+02, threshold=3.487e+02, percent-clipped=0.0 2023-10-11 06:52:19,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=624610.0, ans=0.0 2023-10-11 06:52:54,345 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:53:08,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=624796.6666666666, ans=0.1 2023-10-11 06:53:17,806 INFO [train.py:1031] (0/4) Epoch 10, batch 11000, loss[loss=0.2081, simple_loss=0.3081, pruned_loss=0.05406, over 16822.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2942, pruned_loss=0.05888, over 32651403.02 frames. ], batch size: 188, lr: 3.54e-03, grad_scale: 32.0 2023-10-11 06:53:18,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=12.0 2023-10-11 06:53:19,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=624843.3333333334, ans=0.0 2023-10-11 06:53:22,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=624843.3333333334, ans=0.1 2023-10-11 06:53:32,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=624890.0, ans=0.125 2023-10-11 06:53:37,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=624936.6666666666, ans=0.125 2023-10-11 06:53:48,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=624983.3333333334, ans=0.125 2023-10-11 06:53:52,231 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.789e+02 2.013e+02 2.272e+02 3.229e+02, threshold=4.026e+02, percent-clipped=0.0 2023-10-11 06:54:00,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-10-11 06:54:10,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=625030.0, ans=0.0 2023-10-11 06:54:20,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=625076.6666666666, ans=0.1 2023-10-11 06:54:27,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=625123.3333333334, ans=0.125 2023-10-11 06:54:29,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=625123.3333333334, ans=0.125 2023-10-11 06:54:41,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=625170.0, ans=0.0 2023-10-11 06:54:42,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.46 vs. limit=10.0 2023-10-11 06:54:49,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=625216.6666666666, ans=0.125 2023-10-11 06:55:05,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=625263.3333333334, ans=0.125 2023-10-11 06:55:19,976 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.76 vs. limit=15.0 2023-10-11 06:55:35,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=625356.6666666666, ans=0.0 2023-10-11 06:55:42,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.15 vs. limit=15.0 2023-10-11 06:55:44,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=625403.3333333334, ans=0.125 2023-10-11 06:55:48,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=625450.0, ans=0.0 2023-10-11 06:55:52,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.596e+02 1.788e+02 1.923e+02 2.481e+02, threshold=3.575e+02, percent-clipped=0.0 2023-10-11 06:56:20,722 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-10-11 06:56:20,844 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.86 vs. limit=22.5 2023-10-11 06:56:29,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=625590.0, ans=0.0 2023-10-11 06:56:48,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=625683.3333333334, ans=0.1 2023-10-11 06:56:55,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=625683.3333333334, ans=0.0 2023-10-11 06:56:55,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=625683.3333333334, ans=0.0 2023-10-11 06:57:05,755 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-10-11 06:57:08,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=625730.0, ans=0.125 2023-10-11 06:57:10,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=625776.6666666666, ans=0.125 2023-10-11 06:57:15,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=625776.6666666666, ans=0.125 2023-10-11 06:57:34,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=625870.0, ans=0.1 2023-10-11 06:57:45,576 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.656e+02 1.855e+02 2.127e+02 2.897e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-11 06:57:48,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=15.0 2023-10-11 06:57:49,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=625916.6666666666, ans=0.125 2023-10-11 06:58:07,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=626010.0, ans=0.1 2023-10-11 06:58:33,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=626103.3333333334, ans=0.125 2023-10-11 06:58:37,182 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 06:58:51,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=626150.0, ans=0.125 2023-10-11 06:59:00,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-10-11 06:59:11,041 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-10-11 06:59:13,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=626243.3333333334, ans=0.0 2023-10-11 06:59:18,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=626290.0, ans=0.125 2023-10-11 06:59:18,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=626290.0, ans=0.1 2023-10-11 06:59:29,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=626336.6666666666, ans=0.0 2023-10-11 06:59:39,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=626383.3333333334, ans=0.125 2023-10-11 06:59:43,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.635e+02 1.829e+02 2.119e+02 3.020e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-11 06:59:50,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=626430.0, ans=0.125 2023-10-11 06:59:51,247 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=12.0 2023-10-11 06:59:57,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=626430.0, ans=0.125 2023-10-11 06:59:59,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.07 vs. limit=15.0 2023-10-11 07:00:18,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=626523.3333333334, ans=0.125 2023-10-11 07:00:30,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.75 vs. limit=15.0 2023-10-11 07:00:31,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=626570.0, ans=0.125 2023-10-11 07:00:31,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=626570.0, ans=0.125 2023-10-11 07:00:48,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=626663.3333333334, ans=0.1 2023-10-11 07:00:58,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-10-11 07:01:00,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=626710.0, ans=0.125 2023-10-11 07:01:07,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=626710.0, ans=0.125 2023-10-11 07:01:08,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=626756.6666666666, ans=0.0 2023-10-11 07:01:36,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.727e+02 1.987e+02 2.208e+02 2.705e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-11 07:01:37,913 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.09 vs. limit=22.5 2023-10-11 07:01:39,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-10-11 07:01:54,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.28 vs. limit=15.0 2023-10-11 07:02:00,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.01 vs. limit=15.0 2023-10-11 07:02:09,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=626990.0, ans=0.07 2023-10-11 07:02:42,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=627083.3333333334, ans=0.2 2023-10-11 07:02:48,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.68 vs. limit=15.0 2023-10-11 07:02:53,935 INFO [train.py:1031] (0/4) Epoch 10, batch 11500, loss[loss=0.2022, simple_loss=0.2958, pruned_loss=0.0543, over 16831.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2936, pruned_loss=0.05871, over 32658795.25 frames. ], batch size: 116, lr: 3.53e-03, grad_scale: 32.0 2023-10-11 07:03:09,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=627223.3333333334, ans=10.0 2023-10-11 07:03:10,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=627223.3333333334, ans=0.125 2023-10-11 07:03:13,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=627223.3333333334, ans=0.125 2023-10-11 07:03:22,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=627270.0, ans=0.125 2023-10-11 07:03:27,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=627316.6666666666, ans=0.125 2023-10-11 07:03:30,360 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.740e+02 1.876e+02 2.059e+02 2.578e+02, threshold=3.752e+02, percent-clipped=0.0 2023-10-11 07:03:34,088 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.47 vs. limit=22.5 2023-10-11 07:03:43,512 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:04:04,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=627456.6666666666, ans=0.0 2023-10-11 07:04:34,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=627550.0, ans=0.125 2023-10-11 07:04:51,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=627596.6666666666, ans=0.125 2023-10-11 07:04:52,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=627596.6666666666, ans=0.0 2023-10-11 07:05:16,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=627736.6666666666, ans=0.125 2023-10-11 07:05:31,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.655e+02 1.789e+02 1.958e+02 3.253e+02, threshold=3.578e+02, percent-clipped=0.0 2023-10-11 07:05:35,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=627783.3333333334, ans=0.2 2023-10-11 07:05:36,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=627830.0, ans=0.2 2023-10-11 07:05:38,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=627830.0, ans=0.125 2023-10-11 07:05:39,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=627830.0, ans=0.125 2023-10-11 07:05:39,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=627830.0, ans=0.125 2023-10-11 07:05:50,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=627876.6666666666, ans=0.1 2023-10-11 07:05:56,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=12.0 2023-10-11 07:06:00,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=627923.3333333334, ans=0.125 2023-10-11 07:06:20,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=628016.6666666666, ans=0.2 2023-10-11 07:06:28,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=628016.6666666666, ans=0.125 2023-10-11 07:07:02,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.14 vs. limit=6.0 2023-10-11 07:07:16,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.721e+02 1.918e+02 2.241e+02 3.517e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-11 07:07:17,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=628250.0, ans=0.0 2023-10-11 07:07:23,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=628296.6666666666, ans=0.125 2023-10-11 07:07:45,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=628343.3333333334, ans=0.2 2023-10-11 07:07:47,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=628343.3333333334, ans=0.0 2023-10-11 07:08:08,345 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:08:38,275 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.40 vs. limit=15.0 2023-10-11 07:08:45,410 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.91 vs. limit=15.0 2023-10-11 07:08:52,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=628576.6666666666, ans=0.1 2023-10-11 07:08:54,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=628623.3333333334, ans=0.0 2023-10-11 07:09:23,600 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.641e+02 1.826e+02 2.042e+02 3.089e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-11 07:09:42,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=628810.0, ans=0.0 2023-10-11 07:09:47,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.07 vs. limit=22.5 2023-10-11 07:09:51,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=628810.0, ans=10.0 2023-10-11 07:09:55,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=628856.6666666666, ans=0.125 2023-10-11 07:10:12,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=628903.3333333334, ans=0.2 2023-10-11 07:10:55,771 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-10-11 07:10:58,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=629090.0, ans=0.1 2023-10-11 07:11:02,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=629090.0, ans=0.0 2023-10-11 07:11:22,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.835e+02 2.048e+02 2.340e+02 3.571e+02, threshold=4.096e+02, percent-clipped=0.0 2023-10-11 07:11:25,888 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:11:27,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.41 vs. limit=6.0 2023-10-11 07:12:01,306 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-10-11 07:12:23,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=629416.6666666666, ans=0.0 2023-10-11 07:12:32,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=629463.3333333334, ans=0.5 2023-10-11 07:12:38,949 INFO [train.py:1031] (0/4) Epoch 10, batch 12000, loss[loss=0.1918, simple_loss=0.2903, pruned_loss=0.04663, over 16895.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2937, pruned_loss=0.05848, over 32707633.46 frames. ], batch size: 93, lr: 3.52e-03, grad_scale: 32.0 2023-10-11 07:12:55,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=629556.6666666666, ans=0.125 2023-10-11 07:13:01,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=629603.3333333334, ans=0.0 2023-10-11 07:13:17,498 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.727e+02 1.897e+02 2.112e+02 3.010e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-11 07:13:29,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=629696.6666666666, ans=0.2 2023-10-11 07:13:35,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=629743.3333333334, ans=0.2 2023-10-11 07:13:50,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=629790.0, ans=0.125 2023-10-11 07:14:02,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=629836.6666666666, ans=0.0 2023-10-11 07:14:09,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=629836.6666666666, ans=0.09899494936611666 2023-10-11 07:14:12,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=629883.3333333334, ans=0.125 2023-10-11 07:14:20,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=629883.3333333334, ans=0.125 2023-10-11 07:14:28,988 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=12.0 2023-10-11 07:14:29,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-10-11 07:14:33,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=629976.6666666666, ans=0.1 2023-10-11 07:14:41,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=629976.6666666666, ans=0.125 2023-10-11 07:14:50,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=630023.3333333334, ans=0.0 2023-10-11 07:14:50,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=630023.3333333334, ans=0.2 2023-10-11 07:15:01,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=630070.0, ans=0.1 2023-10-11 07:15:08,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=630116.6666666666, ans=0.0 2023-10-11 07:15:09,023 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.632e+02 1.789e+02 2.011e+02 3.021e+02, threshold=3.578e+02, percent-clipped=0.0 2023-10-11 07:15:09,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=630116.6666666666, ans=0.2 2023-10-11 07:15:16,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=630163.3333333334, ans=0.125 2023-10-11 07:15:19,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=630163.3333333334, ans=0.125 2023-10-11 07:15:22,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=630163.3333333334, ans=0.04949747468305833 2023-10-11 07:15:30,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=630210.0, ans=0.1 2023-10-11 07:15:46,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.12 vs. limit=15.0 2023-10-11 07:15:49,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=630303.3333333334, ans=0.125 2023-10-11 07:15:50,018 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.25 vs. limit=15.0 2023-10-11 07:15:52,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=630303.3333333334, ans=0.1 2023-10-11 07:16:00,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=630350.0, ans=0.07 2023-10-11 07:16:18,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=630396.6666666666, ans=0.2 2023-10-11 07:16:49,618 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.89 vs. limit=10.0 2023-10-11 07:16:59,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.705e+02 1.896e+02 2.246e+02 4.002e+02, threshold=3.792e+02, percent-clipped=2.0 2023-10-11 07:17:24,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=630676.6666666666, ans=0.125 2023-10-11 07:17:30,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=630723.3333333334, ans=0.0 2023-10-11 07:17:33,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=630723.3333333334, ans=0.2 2023-10-11 07:17:40,227 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.76 vs. limit=15.0 2023-10-11 07:18:01,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=630863.3333333334, ans=0.125 2023-10-11 07:18:20,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=630910.0, ans=0.0 2023-10-11 07:18:34,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=631003.3333333334, ans=0.2 2023-10-11 07:18:35,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=631003.3333333334, ans=0.125 2023-10-11 07:18:35,961 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-10-11 07:18:46,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=631050.0, ans=0.07 2023-10-11 07:18:52,369 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.716e+02 1.875e+02 2.192e+02 2.863e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-11 07:19:01,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=631096.6666666666, ans=0.04949747468305833 2023-10-11 07:19:04,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-10-11 07:19:11,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=631143.3333333334, ans=0.125 2023-10-11 07:19:11,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=631143.3333333334, ans=0.07 2023-10-11 07:19:28,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=631190.0, ans=0.125 2023-10-11 07:19:45,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.81 vs. limit=15.0 2023-10-11 07:19:58,916 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.13 vs. limit=15.0 2023-10-11 07:20:01,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=631330.0, ans=0.1 2023-10-11 07:20:12,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=631376.6666666666, ans=0.125 2023-10-11 07:20:24,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=631423.3333333334, ans=0.125 2023-10-11 07:20:26,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=631423.3333333334, ans=0.0 2023-10-11 07:20:46,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.681e+02 1.875e+02 2.101e+02 3.163e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-11 07:20:47,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=631516.6666666666, ans=0.2 2023-10-11 07:21:12,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=631610.0, ans=0.0 2023-10-11 07:21:16,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=631610.0, ans=0.125 2023-10-11 07:21:21,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=631656.6666666666, ans=0.1 2023-10-11 07:21:28,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=631703.3333333334, ans=0.125 2023-10-11 07:21:30,513 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.59 vs. limit=22.5 2023-10-11 07:21:45,071 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.90 vs. limit=15.0 2023-10-11 07:21:48,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-10-11 07:21:55,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=631796.6666666666, ans=0.0 2023-10-11 07:21:56,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=631796.6666666666, ans=0.0 2023-10-11 07:22:04,500 INFO [train.py:1031] (0/4) Epoch 10, batch 12500, loss[loss=0.1927, simple_loss=0.2889, pruned_loss=0.04825, over 15423.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2934, pruned_loss=0.05839, over 32726261.94 frames. ], batch size: 35, lr: 3.52e-03, grad_scale: 32.0 2023-10-11 07:22:14,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=631843.3333333334, ans=0.125 2023-10-11 07:22:42,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=631983.3333333334, ans=0.5 2023-10-11 07:22:42,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.669e+02 1.891e+02 2.113e+02 2.917e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-11 07:22:53,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=632030.0, ans=0.125 2023-10-11 07:22:56,904 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:23:09,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-10-11 07:23:17,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=632123.3333333334, ans=0.0 2023-10-11 07:23:22,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=632170.0, ans=0.125 2023-10-11 07:23:46,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=632263.3333333334, ans=0.125 2023-10-11 07:23:47,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=632263.3333333334, ans=15.0 2023-10-11 07:24:10,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.50 vs. limit=6.0 2023-10-11 07:24:29,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=632403.3333333334, ans=0.0 2023-10-11 07:24:30,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=632450.0, ans=0.0 2023-10-11 07:24:31,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=632450.0, ans=0.125 2023-10-11 07:24:31,603 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-10-11 07:24:33,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=632450.0, ans=0.0 2023-10-11 07:24:35,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.751e+02 1.971e+02 2.298e+02 3.391e+02, threshold=3.943e+02, percent-clipped=0.0 2023-10-11 07:24:38,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=632450.0, ans=0.125 2023-10-11 07:24:40,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=632450.0, ans=0.0 2023-10-11 07:25:08,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=632590.0, ans=0.125 2023-10-11 07:25:18,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=632636.6666666666, ans=0.2 2023-10-11 07:25:37,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=632683.3333333334, ans=0.0 2023-10-11 07:26:05,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=632823.3333333334, ans=0.0 2023-10-11 07:26:24,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=632916.6666666666, ans=0.1 2023-10-11 07:26:26,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.622e+02 1.794e+02 2.072e+02 3.230e+02, threshold=3.587e+02, percent-clipped=0.0 2023-10-11 07:27:10,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-10-11 07:27:42,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=633243.3333333334, ans=0.125 2023-10-11 07:27:53,306 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-10-11 07:28:04,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=633336.6666666666, ans=0.125 2023-10-11 07:28:15,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.731e+02 1.950e+02 2.312e+02 3.360e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-11 07:28:20,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=633383.3333333334, ans=0.125 2023-10-11 07:28:24,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=633430.0, ans=10.0 2023-10-11 07:28:33,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=633430.0, ans=0.035 2023-10-11 07:28:41,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=633476.6666666666, ans=0.1 2023-10-11 07:28:51,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.27 vs. limit=15.0 2023-10-11 07:29:20,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=633663.3333333334, ans=0.125 2023-10-11 07:29:36,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=633710.0, ans=0.0 2023-10-11 07:29:39,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=633710.0, ans=0.2 2023-10-11 07:29:43,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=633756.6666666666, ans=0.125 2023-10-11 07:29:51,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=633803.3333333334, ans=15.0 2023-10-11 07:30:00,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=633803.3333333334, ans=0.125 2023-10-11 07:30:00,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=633803.3333333334, ans=0.125 2023-10-11 07:30:00,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=633803.3333333334, ans=0.2 2023-10-11 07:30:05,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-10-11 07:30:07,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.640e+02 1.798e+02 1.977e+02 2.764e+02, threshold=3.597e+02, percent-clipped=0.0 2023-10-11 07:30:26,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=633943.3333333334, ans=0.0 2023-10-11 07:30:34,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=633943.3333333334, ans=0.125 2023-10-11 07:30:40,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=633990.0, ans=0.125 2023-10-11 07:30:49,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=634036.6666666666, ans=0.015 2023-10-11 07:30:54,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=634036.6666666666, ans=0.125 2023-10-11 07:31:04,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=634083.3333333334, ans=0.125 2023-10-11 07:31:17,292 INFO [train.py:1031] (0/4) Epoch 10, batch 13000, loss[loss=0.2008, simple_loss=0.2908, pruned_loss=0.05545, over 17020.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2939, pruned_loss=0.05842, over 32735067.35 frames. ], batch size: 117, lr: 3.51e-03, grad_scale: 32.0 2023-10-11 07:31:17,817 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.67 vs. limit=22.5 2023-10-11 07:31:41,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=634223.3333333334, ans=0.125 2023-10-11 07:31:47,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=634270.0, ans=0.05 2023-10-11 07:32:01,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.644e+02 1.813e+02 1.965e+02 2.537e+02, threshold=3.627e+02, percent-clipped=0.0 2023-10-11 07:32:11,679 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.61 vs. limit=15.0 2023-10-11 07:32:16,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=634363.3333333334, ans=0.125 2023-10-11 07:32:39,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=634456.6666666666, ans=0.0 2023-10-11 07:32:40,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=634456.6666666666, ans=0.125 2023-10-11 07:32:44,609 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.66 vs. limit=6.0 2023-10-11 07:32:46,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=634503.3333333334, ans=0.2 2023-10-11 07:32:58,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=634550.0, ans=0.2 2023-10-11 07:33:23,082 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-136000.pt 2023-10-11 07:33:28,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=634643.3333333334, ans=0.0 2023-10-11 07:33:56,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=634783.3333333334, ans=0.1 2023-10-11 07:33:58,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=634783.3333333334, ans=0.125 2023-10-11 07:33:58,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=634783.3333333334, ans=0.1 2023-10-11 07:34:00,120 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.712e+02 1.881e+02 2.185e+02 3.275e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-11 07:34:25,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=634876.6666666666, ans=0.1 2023-10-11 07:34:31,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=634923.3333333334, ans=0.2 2023-10-11 07:34:32,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.57 vs. limit=15.0 2023-10-11 07:34:58,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.32 vs. limit=22.5 2023-10-11 07:35:19,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=635110.0, ans=0.1 2023-10-11 07:35:23,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=635110.0, ans=0.0 2023-10-11 07:35:28,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=635156.6666666666, ans=0.1 2023-10-11 07:35:31,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=635156.6666666666, ans=0.0 2023-10-11 07:35:53,683 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.718e+02 1.978e+02 2.196e+02 2.909e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-11 07:36:02,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=635296.6666666666, ans=0.125 2023-10-11 07:36:21,809 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.09 vs. limit=15.0 2023-10-11 07:36:37,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=635436.6666666666, ans=0.125 2023-10-11 07:36:43,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=635436.6666666666, ans=15.0 2023-10-11 07:37:02,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=635530.0, ans=0.0 2023-10-11 07:37:07,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=635576.6666666666, ans=0.2 2023-10-11 07:37:17,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=635576.6666666666, ans=0.1 2023-10-11 07:37:24,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=635623.3333333334, ans=0.125 2023-10-11 07:37:31,823 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.56 vs. limit=10.0 2023-10-11 07:37:34,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=635670.0, ans=0.05 2023-10-11 07:37:34,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=635670.0, ans=0.125 2023-10-11 07:37:46,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.802e+02 2.080e+02 2.362e+02 3.200e+02, threshold=4.160e+02, percent-clipped=0.0 2023-10-11 07:38:01,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=635763.3333333334, ans=0.1 2023-10-11 07:38:15,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=635856.6666666666, ans=0.125 2023-10-11 07:38:16,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=635856.6666666666, ans=0.0 2023-10-11 07:38:24,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=635903.3333333334, ans=0.07 2023-10-11 07:38:36,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=635950.0, ans=0.125 2023-10-11 07:38:43,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=635950.0, ans=15.0 2023-10-11 07:39:04,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=636043.3333333334, ans=0.1 2023-10-11 07:39:05,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=636043.3333333334, ans=0.0 2023-10-11 07:39:12,673 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.94 vs. limit=15.0 2023-10-11 07:39:36,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.656e+02 1.798e+02 1.912e+02 2.514e+02, threshold=3.596e+02, percent-clipped=0.0 2023-10-11 07:40:06,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=636323.3333333334, ans=0.1 2023-10-11 07:40:14,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=636323.3333333334, ans=10.0 2023-10-11 07:40:15,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=636370.0, ans=0.2 2023-10-11 07:40:17,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.37 vs. limit=12.0 2023-10-11 07:40:22,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=636370.0, ans=0.125 2023-10-11 07:40:46,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=636463.3333333334, ans=0.0 2023-10-11 07:40:48,730 INFO [train.py:1031] (0/4) Epoch 10, batch 13500, loss[loss=0.2147, simple_loss=0.313, pruned_loss=0.05817, over 16939.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2933, pruned_loss=0.05823, over 32755656.75 frames. ], batch size: 138, lr: 3.50e-03, grad_scale: 32.0 2023-10-11 07:40:57,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=636510.0, ans=0.1 2023-10-11 07:41:01,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=636556.6666666666, ans=0.015 2023-10-11 07:41:05,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=636556.6666666666, ans=0.0 2023-10-11 07:41:06,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=636556.6666666666, ans=0.0 2023-10-11 07:41:22,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=636650.0, ans=0.1 2023-10-11 07:41:27,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.650e+02 1.883e+02 2.109e+02 3.516e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-11 07:41:28,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=636650.0, ans=0.125 2023-10-11 07:41:45,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=636743.3333333334, ans=0.125 2023-10-11 07:41:53,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=636743.3333333334, ans=0.1 2023-10-11 07:42:00,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=636790.0, ans=0.0 2023-10-11 07:42:14,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=636836.6666666666, ans=0.125 2023-10-11 07:42:14,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=636836.6666666666, ans=0.125 2023-10-11 07:42:48,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=636976.6666666666, ans=0.0 2023-10-11 07:42:55,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=637023.3333333334, ans=0.125 2023-10-11 07:42:55,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=637023.3333333334, ans=10.0 2023-10-11 07:42:56,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=637023.3333333334, ans=0.0 2023-10-11 07:43:14,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.726e+02 1.894e+02 2.195e+02 3.015e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-11 07:43:20,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=637163.3333333334, ans=0.1 2023-10-11 07:43:26,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=637163.3333333334, ans=0.035 2023-10-11 07:43:34,186 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-10.pt 2023-10-11 07:44:05,803 INFO [train.py:1031] (0/4) Epoch 11, batch 0, loss[loss=0.1727, simple_loss=0.272, pruned_loss=0.0367, over 16824.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.272, pruned_loss=0.0367, over 16824.00 frames. ], batch size: 98, lr: 3.32e-03, grad_scale: 64.0 2023-10-11 07:44:05,804 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-11 07:44:14,259 INFO [train.py:1063] (0/4) Epoch 11, validation: loss=0.22, simple_loss=0.3069, pruned_loss=0.06655, over 1020973.00 frames. 2023-10-11 07:44:14,260 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-11 07:44:18,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=637233.3333333334, ans=0.0 2023-10-11 07:44:42,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-10-11 07:44:47,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=637373.3333333334, ans=0.125 2023-10-11 07:44:48,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=637373.3333333334, ans=0.125 2023-10-11 07:45:04,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.98 vs. limit=22.5 2023-10-11 07:45:09,268 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:45:15,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=637466.6666666666, ans=0.125 2023-10-11 07:45:17,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=637466.6666666666, ans=0.125 2023-10-11 07:45:17,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=637466.6666666666, ans=0.125 2023-10-11 07:45:23,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=637513.3333333334, ans=0.125 2023-10-11 07:45:23,915 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=15.0 2023-10-11 07:45:31,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=637513.3333333334, ans=0.1 2023-10-11 07:45:42,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-10-11 07:45:43,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.754e+02 2.011e+02 2.352e+02 3.000e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-11 07:45:55,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=637653.3333333334, ans=0.2 2023-10-11 07:45:56,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=637653.3333333334, ans=0.0 2023-10-11 07:46:11,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=637700.0, ans=0.2 2023-10-11 07:46:20,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-10-11 07:46:30,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=637793.3333333334, ans=0.125 2023-10-11 07:47:00,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=637886.6666666666, ans=0.1 2023-10-11 07:47:35,279 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.321e+02 1.655e+02 1.855e+02 2.101e+02 3.603e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-11 07:47:38,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=638073.3333333334, ans=0.0 2023-10-11 07:47:39,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=638073.3333333334, ans=0.0 2023-10-11 07:47:44,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=638073.3333333334, ans=0.0 2023-10-11 07:47:56,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=638166.6666666666, ans=0.125 2023-10-11 07:48:30,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=638306.6666666666, ans=0.0 2023-10-11 07:48:31,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=638306.6666666666, ans=0.125 2023-10-11 07:48:38,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=638353.3333333334, ans=0.125 2023-10-11 07:48:40,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=638353.3333333334, ans=0.125 2023-10-11 07:49:18,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=638493.3333333334, ans=10.0 2023-10-11 07:49:20,416 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:49:21,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=638493.3333333334, ans=0.125 2023-10-11 07:49:27,421 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.748e+02 1.867e+02 2.110e+02 3.204e+02, threshold=3.735e+02, percent-clipped=0.0 2023-10-11 07:49:31,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=638540.0, ans=0.0 2023-10-11 07:49:46,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.70 vs. limit=15.0 2023-10-11 07:49:53,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=638633.3333333334, ans=0.1 2023-10-11 07:49:56,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=638633.3333333334, ans=0.125 2023-10-11 07:50:12,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=638726.6666666666, ans=0.0 2023-10-11 07:50:17,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=638726.6666666666, ans=0.1 2023-10-11 07:51:01,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=638913.3333333334, ans=0.2 2023-10-11 07:51:13,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.712e+02 1.931e+02 2.200e+02 2.947e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-11 07:51:15,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=639006.6666666666, ans=0.125 2023-10-11 07:51:23,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=639006.6666666666, ans=0.125 2023-10-11 07:51:37,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=639100.0, ans=0.0 2023-10-11 07:51:47,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=639100.0, ans=0.0 2023-10-11 07:51:54,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=639146.6666666666, ans=0.0 2023-10-11 07:51:59,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=639193.3333333334, ans=0.1 2023-10-11 07:52:07,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=639193.3333333334, ans=0.2 2023-10-11 07:52:21,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=639286.6666666666, ans=0.125 2023-10-11 07:52:26,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=639286.6666666666, ans=0.125 2023-10-11 07:52:55,045 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.02 vs. limit=15.0 2023-10-11 07:53:08,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.744e+02 1.959e+02 2.289e+02 3.637e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-11 07:53:21,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-10-11 07:53:27,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=639520.0, ans=15.0 2023-10-11 07:53:31,703 INFO [train.py:1031] (0/4) Epoch 11, batch 500, loss[loss=0.1837, simple_loss=0.2805, pruned_loss=0.04343, over 16904.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.293, pruned_loss=0.05776, over 7299562.77 frames. ], batch size: 72, lr: 3.32e-03, grad_scale: 16.0 2023-10-11 07:53:36,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=15.0 2023-10-11 07:53:43,156 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-11 07:53:59,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=639660.0, ans=0.09899494936611666 2023-10-11 07:54:02,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=639706.6666666666, ans=0.125 2023-10-11 07:54:25,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=639800.0, ans=0.0 2023-10-11 07:54:32,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=639800.0, ans=0.125 2023-10-11 07:54:37,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=639846.6666666666, ans=0.0 2023-10-11 07:54:47,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=639893.3333333334, ans=0.125 2023-10-11 07:54:51,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=639893.3333333334, ans=0.0 2023-10-11 07:54:53,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=639893.3333333334, ans=0.1 2023-10-11 07:54:56,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=639893.3333333334, ans=0.125 2023-10-11 07:54:58,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.761e+02 1.926e+02 2.143e+02 2.774e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-11 07:55:02,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.90 vs. limit=15.0 2023-10-11 07:55:04,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.43 vs. limit=10.0 2023-10-11 07:55:14,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=639986.6666666666, ans=0.125 2023-10-11 07:55:26,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=640033.3333333334, ans=0.125 2023-10-11 07:55:45,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=640126.6666666666, ans=0.1 2023-10-11 07:56:00,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=640173.3333333334, ans=0.125 2023-10-11 07:56:08,175 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.38 vs. limit=22.5 2023-10-11 07:56:18,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=640266.6666666666, ans=0.0 2023-10-11 07:56:18,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=640266.6666666666, ans=0.125 2023-10-11 07:56:31,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.18 vs. limit=15.0 2023-10-11 07:56:33,360 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.97 vs. limit=15.0 2023-10-11 07:56:42,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=640360.0, ans=0.2 2023-10-11 07:56:47,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.708e+02 1.873e+02 2.125e+02 3.173e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-11 07:57:05,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.16 vs. limit=15.0 2023-10-11 07:57:36,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=640593.3333333334, ans=0.09899494936611666 2023-10-11 07:57:40,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=640640.0, ans=0.0 2023-10-11 07:57:51,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=640686.6666666666, ans=0.0 2023-10-11 07:57:55,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=640686.6666666666, ans=0.125 2023-10-11 07:58:04,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-10-11 07:58:06,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=640733.3333333334, ans=0.0 2023-10-11 07:58:15,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=640733.3333333334, ans=0.2 2023-10-11 07:58:21,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=640780.0, ans=0.2 2023-10-11 07:58:39,343 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.748e+02 1.934e+02 2.275e+02 4.012e+02, threshold=3.867e+02, percent-clipped=1.0 2023-10-11 07:58:59,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=640920.0, ans=0.0 2023-10-11 07:59:06,563 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 07:59:27,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=641060.0, ans=0.125 2023-10-11 07:59:45,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.30 vs. limit=15.0 2023-10-11 07:59:47,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.28 vs. limit=15.0 2023-10-11 07:59:48,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=641106.6666666666, ans=0.125 2023-10-11 08:00:02,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=641200.0, ans=0.2 2023-10-11 08:00:37,148 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.673e+02 1.803e+02 2.035e+02 2.555e+02, threshold=3.607e+02, percent-clipped=0.0 2023-10-11 08:00:40,565 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.41 vs. limit=10.0 2023-10-11 08:00:40,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-10-11 08:00:51,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=641386.6666666666, ans=0.0 2023-10-11 08:00:52,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=641386.6666666666, ans=0.125 2023-10-11 08:00:53,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=641386.6666666666, ans=0.025 2023-10-11 08:01:02,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=641433.3333333334, ans=0.0 2023-10-11 08:01:08,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=641480.0, ans=0.05 2023-10-11 08:01:25,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=641526.6666666666, ans=0.125 2023-10-11 08:01:28,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=641526.6666666666, ans=0.125 2023-10-11 08:01:53,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=641620.0, ans=0.2 2023-10-11 08:02:06,863 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.54 vs. limit=10.0 2023-10-11 08:02:09,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=641713.3333333334, ans=0.125 2023-10-11 08:02:09,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=641713.3333333334, ans=0.0 2023-10-11 08:02:12,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=641713.3333333334, ans=0.0 2023-10-11 08:02:14,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=641713.3333333334, ans=0.1 2023-10-11 08:02:28,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-11 08:02:28,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.545e+02 1.759e+02 1.878e+02 3.047e+02, threshold=3.518e+02, percent-clipped=0.0 2023-10-11 08:02:37,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=641853.3333333334, ans=0.0 2023-10-11 08:02:38,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.30 vs. limit=22.5 2023-10-11 08:02:48,738 INFO [train.py:1031] (0/4) Epoch 11, batch 1000, loss[loss=0.1883, simple_loss=0.2802, pruned_loss=0.0482, over 16930.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2939, pruned_loss=0.05858, over 12924279.36 frames. ], batch size: 77, lr: 3.31e-03, grad_scale: 32.0 2023-10-11 08:03:15,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=641993.3333333334, ans=0.125 2023-10-11 08:03:29,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=642086.6666666666, ans=0.0 2023-10-11 08:03:33,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=642086.6666666666, ans=0.125 2023-10-11 08:03:50,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=642180.0, ans=0.125 2023-10-11 08:03:55,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=642180.0, ans=0.2 2023-10-11 08:03:56,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=642180.0, ans=0.0 2023-10-11 08:04:12,742 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=22.5 2023-10-11 08:04:13,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.693e+02 1.909e+02 2.260e+02 3.030e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-11 08:04:15,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=642273.3333333334, ans=0.2 2023-10-11 08:04:22,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=642273.3333333334, ans=0.0 2023-10-11 08:04:24,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=642320.0, ans=0.04949747468305833 2023-10-11 08:04:27,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=642320.0, ans=0.1 2023-10-11 08:04:27,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=642320.0, ans=0.1 2023-10-11 08:04:31,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=642320.0, ans=0.125 2023-10-11 08:04:33,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=642366.6666666666, ans=0.09899494936611666 2023-10-11 08:04:34,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=642366.6666666666, ans=0.125 2023-10-11 08:04:36,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=642366.6666666666, ans=0.125 2023-10-11 08:05:17,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=642506.6666666666, ans=0.125 2023-10-11 08:05:29,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=642553.3333333334, ans=0.0 2023-10-11 08:05:30,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=642553.3333333334, ans=0.04949747468305833 2023-10-11 08:05:30,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=642553.3333333334, ans=0.0 2023-10-11 08:05:43,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=642600.0, ans=0.0 2023-10-11 08:05:51,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=642646.6666666666, ans=0.0 2023-10-11 08:05:53,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=642646.6666666666, ans=0.04949747468305833 2023-10-11 08:05:59,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=642646.6666666666, ans=0.1 2023-10-11 08:06:17,524 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.690e+02 1.863e+02 2.133e+02 2.919e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-11 08:06:19,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=642740.0, ans=0.125 2023-10-11 08:06:35,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=642786.6666666666, ans=0.0 2023-10-11 08:06:38,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=642833.3333333334, ans=0.125 2023-10-11 08:06:38,430 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.97 vs. limit=15.0 2023-10-11 08:06:38,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.49 vs. limit=10.0 2023-10-11 08:07:02,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=642926.6666666666, ans=0.125 2023-10-11 08:07:05,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=642926.6666666666, ans=0.125 2023-10-11 08:07:06,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=642926.6666666666, ans=0.0 2023-10-11 08:07:23,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=643020.0, ans=0.125 2023-10-11 08:07:35,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=643066.6666666666, ans=0.125 2023-10-11 08:07:45,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=643113.3333333334, ans=0.09899494936611666 2023-10-11 08:08:06,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.289e+02 1.633e+02 1.809e+02 2.052e+02 3.009e+02, threshold=3.618e+02, percent-clipped=0.0 2023-10-11 08:08:21,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=643253.3333333334, ans=0.0 2023-10-11 08:08:27,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=643300.0, ans=0.1 2023-10-11 08:08:57,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.47 vs. limit=10.0 2023-10-11 08:09:03,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=643440.0, ans=0.09899494936611666 2023-10-11 08:09:27,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=643533.3333333334, ans=0.0 2023-10-11 08:09:32,449 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.74 vs. limit=15.0 2023-10-11 08:09:32,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=643580.0, ans=0.0 2023-10-11 08:09:34,931 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:09:51,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-10-11 08:09:54,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.635e+02 1.801e+02 2.048e+02 2.753e+02, threshold=3.602e+02, percent-clipped=0.0 2023-10-11 08:09:58,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=643673.3333333334, ans=0.1 2023-10-11 08:10:03,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=643673.3333333334, ans=0.125 2023-10-11 08:10:04,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=643720.0, ans=0.125 2023-10-11 08:10:18,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-10-11 08:10:19,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=643766.6666666666, ans=0.04949747468305833 2023-10-11 08:10:29,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=643813.3333333334, ans=0.2 2023-10-11 08:10:30,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=643813.3333333334, ans=10.0 2023-10-11 08:10:43,700 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.92 vs. limit=22.5 2023-10-11 08:10:44,445 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-10-11 08:10:46,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=643860.0, ans=0.125 2023-10-11 08:10:48,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=643906.6666666666, ans=0.125 2023-10-11 08:10:51,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=643906.6666666666, ans=0.0 2023-10-11 08:10:55,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=643906.6666666666, ans=0.2 2023-10-11 08:11:03,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=643953.3333333334, ans=0.07 2023-10-11 08:11:25,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.62 vs. limit=15.0 2023-10-11 08:11:37,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=644093.3333333334, ans=0.125 2023-10-11 08:11:46,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.695e+02 1.901e+02 2.189e+02 3.215e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-11 08:12:11,766 INFO [train.py:1031] (0/4) Epoch 11, batch 1500, loss[loss=0.2105, simple_loss=0.2983, pruned_loss=0.06138, over 16487.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2922, pruned_loss=0.05801, over 17293160.03 frames. ], batch size: 266, lr: 3.31e-03, grad_scale: 32.0 2023-10-11 08:12:30,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.17 vs. limit=10.0 2023-10-11 08:12:46,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=644373.3333333334, ans=0.05 2023-10-11 08:12:57,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=644420.0, ans=0.125 2023-10-11 08:13:21,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=644513.3333333334, ans=0.1 2023-10-11 08:13:34,853 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:13:42,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.743e+02 1.906e+02 2.177e+02 3.390e+02, threshold=3.812e+02, percent-clipped=0.0 2023-10-11 08:13:44,577 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-10-11 08:14:10,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=644700.0, ans=0.95 2023-10-11 08:14:16,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=644746.6666666666, ans=0.125 2023-10-11 08:14:32,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-10-11 08:15:16,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=644933.3333333334, ans=0.125 2023-10-11 08:15:41,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.672e+02 1.834e+02 1.966e+02 3.140e+02, threshold=3.669e+02, percent-clipped=0.0 2023-10-11 08:15:58,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=645120.0, ans=0.0 2023-10-11 08:15:58,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=645120.0, ans=0.125 2023-10-11 08:16:01,735 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2023-10-11 08:16:12,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=645213.3333333334, ans=0.07 2023-10-11 08:16:18,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=645213.3333333334, ans=0.0 2023-10-11 08:16:44,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.75 vs. limit=12.0 2023-10-11 08:16:51,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=645353.3333333334, ans=0.0 2023-10-11 08:16:55,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=645400.0, ans=0.0 2023-10-11 08:17:04,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.39 vs. limit=15.0 2023-10-11 08:17:14,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=645446.6666666666, ans=0.1 2023-10-11 08:17:28,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=645493.3333333334, ans=0.125 2023-10-11 08:17:31,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.315e+02 1.697e+02 1.914e+02 2.087e+02 2.965e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-11 08:17:33,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=645540.0, ans=0.1 2023-10-11 08:17:43,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-10-11 08:18:10,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=645680.0, ans=0.125 2023-10-11 08:18:29,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=645726.6666666666, ans=0.2 2023-10-11 08:18:56,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=645866.6666666666, ans=0.1 2023-10-11 08:18:58,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=645866.6666666666, ans=0.2 2023-10-11 08:19:10,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.91 vs. limit=15.0 2023-10-11 08:19:14,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=645913.3333333334, ans=0.125 2023-10-11 08:19:17,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=645960.0, ans=0.0 2023-10-11 08:19:24,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=645960.0, ans=0.2 2023-10-11 08:19:27,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.719e+02 1.956e+02 2.144e+02 3.870e+02, threshold=3.911e+02, percent-clipped=1.0 2023-10-11 08:19:33,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=646006.6666666666, ans=0.125 2023-10-11 08:19:47,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=646053.3333333334, ans=0.125 2023-10-11 08:19:51,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=646100.0, ans=0.1 2023-10-11 08:20:10,772 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-10-11 08:20:23,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=646240.0, ans=0.125 2023-10-11 08:20:34,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=646240.0, ans=0.09899494936611666 2023-10-11 08:20:34,833 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-10-11 08:21:31,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=646426.6666666666, ans=0.0 2023-10-11 08:21:32,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=646426.6666666666, ans=0.1 2023-10-11 08:21:35,539 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.660e+02 1.812e+02 2.162e+02 3.366e+02, threshold=3.625e+02, percent-clipped=0.0 2023-10-11 08:21:39,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.78 vs. limit=15.0 2023-10-11 08:21:40,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=646473.3333333334, ans=0.02 2023-10-11 08:21:48,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=646520.0, ans=0.2 2023-10-11 08:21:52,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=646520.0, ans=0.125 2023-10-11 08:21:57,671 INFO [train.py:1031] (0/4) Epoch 11, batch 2000, loss[loss=0.2299, simple_loss=0.3162, pruned_loss=0.07184, over 16018.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2928, pruned_loss=0.05803, over 20739340.54 frames. ], batch size: 296, lr: 3.30e-03, grad_scale: 32.0 2023-10-11 08:22:56,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=646753.3333333334, ans=0.07 2023-10-11 08:22:56,628 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.81 vs. limit=15.0 2023-10-11 08:23:02,681 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=15.0 2023-10-11 08:23:03,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=646753.3333333334, ans=0.1 2023-10-11 08:23:14,343 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.16 vs. limit=22.5 2023-10-11 08:23:43,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.659e+02 1.847e+02 2.128e+02 3.565e+02, threshold=3.694e+02, percent-clipped=0.0 2023-10-11 08:23:45,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=646940.0, ans=0.1 2023-10-11 08:23:45,860 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.89 vs. limit=12.0 2023-10-11 08:23:48,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=646940.0, ans=0.0 2023-10-11 08:23:59,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=646986.6666666666, ans=0.0 2023-10-11 08:24:16,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=647033.3333333334, ans=0.0 2023-10-11 08:24:22,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.73 vs. limit=22.5 2023-10-11 08:24:51,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=647126.6666666666, ans=0.2 2023-10-11 08:24:52,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=647126.6666666666, ans=0.0 2023-10-11 08:24:53,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=647126.6666666666, ans=0.0 2023-10-11 08:25:20,237 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.569e-03 2023-10-11 08:25:23,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=647266.6666666666, ans=0.07 2023-10-11 08:25:23,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=647266.6666666666, ans=0.1 2023-10-11 08:25:24,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=647266.6666666666, ans=0.125 2023-10-11 08:25:24,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=647266.6666666666, ans=0.125 2023-10-11 08:25:43,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=647313.3333333334, ans=0.0 2023-10-11 08:25:57,736 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:26:00,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.769e+02 1.954e+02 2.323e+02 3.601e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-11 08:26:03,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=647406.6666666666, ans=0.0 2023-10-11 08:26:13,279 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.81 vs. limit=22.5 2023-10-11 08:26:16,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=647453.3333333334, ans=0.125 2023-10-11 08:26:24,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=647500.0, ans=0.125 2023-10-11 08:26:45,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.51 vs. limit=22.5 2023-10-11 08:27:12,294 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:27:16,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=647733.3333333334, ans=0.1 2023-10-11 08:27:25,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=647780.0, ans=0.1 2023-10-11 08:27:36,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=647826.6666666666, ans=0.0 2023-10-11 08:27:42,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=647826.6666666666, ans=0.0 2023-10-11 08:27:49,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.699e+02 1.899e+02 2.245e+02 2.904e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-11 08:28:02,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=647920.0, ans=0.125 2023-10-11 08:28:08,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=647966.6666666666, ans=0.1 2023-10-11 08:28:09,200 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.11 vs. limit=15.0 2023-10-11 08:28:22,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=648013.3333333334, ans=0.125 2023-10-11 08:28:26,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=648013.3333333334, ans=0.125 2023-10-11 08:29:06,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=648200.0, ans=0.1 2023-10-11 08:29:16,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=648246.6666666666, ans=0.125 2023-10-11 08:29:26,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=648246.6666666666, ans=0.2 2023-10-11 08:29:27,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=648293.3333333334, ans=0.125 2023-10-11 08:29:41,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.698e+02 1.826e+02 1.983e+02 2.729e+02, threshold=3.651e+02, percent-clipped=0.0 2023-10-11 08:29:52,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=648386.6666666666, ans=0.1 2023-10-11 08:29:56,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=648386.6666666666, ans=0.125 2023-10-11 08:30:20,764 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.086e-01 2023-10-11 08:30:24,452 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:30:34,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=648573.3333333334, ans=0.05 2023-10-11 08:30:36,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=648573.3333333334, ans=0.125 2023-10-11 08:30:42,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.33 vs. limit=15.0 2023-10-11 08:30:46,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=648620.0, ans=0.5 2023-10-11 08:30:49,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.85 vs. limit=22.5 2023-10-11 08:30:53,797 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-10-11 08:31:01,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=648666.6666666666, ans=0.07 2023-10-11 08:31:13,287 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.23 vs. limit=12.0 2023-10-11 08:31:22,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=648760.0, ans=0.07 2023-10-11 08:31:22,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=648760.0, ans=0.2 2023-10-11 08:31:27,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.685e+02 1.871e+02 2.097e+02 2.995e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-11 08:31:27,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=648806.6666666666, ans=0.2 2023-10-11 08:31:29,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=648806.6666666666, ans=0.125 2023-10-11 08:31:33,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=648806.6666666666, ans=0.125 2023-10-11 08:31:36,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-10-11 08:31:47,649 INFO [train.py:1031] (0/4) Epoch 11, batch 2500, loss[loss=0.206, simple_loss=0.2965, pruned_loss=0.05779, over 16883.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2932, pruned_loss=0.05829, over 23400251.56 frames. ], batch size: 110, lr: 3.29e-03, grad_scale: 32.0 2023-10-11 08:31:49,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=648900.0, ans=0.1 2023-10-11 08:31:54,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=648900.0, ans=0.125 2023-10-11 08:31:56,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=648900.0, ans=0.1 2023-10-11 08:31:58,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=648946.6666666666, ans=0.1 2023-10-11 08:32:14,451 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.34 vs. limit=6.0 2023-10-11 08:32:28,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=649040.0, ans=0.0 2023-10-11 08:32:34,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=649086.6666666666, ans=0.04949747468305833 2023-10-11 08:32:36,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=649086.6666666666, ans=0.09899494936611666 2023-10-11 08:33:13,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.16 vs. limit=22.5 2023-10-11 08:33:17,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.733e+02 1.893e+02 2.172e+02 3.079e+02, threshold=3.787e+02, percent-clipped=0.0 2023-10-11 08:33:35,416 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:33:45,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=649366.6666666666, ans=0.125 2023-10-11 08:33:56,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.63 vs. limit=22.5 2023-10-11 08:34:09,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.62 vs. limit=15.0 2023-10-11 08:34:20,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=649506.6666666666, ans=0.125 2023-10-11 08:34:21,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=649506.6666666666, ans=0.5 2023-10-11 08:34:27,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=649553.3333333334, ans=0.125 2023-10-11 08:34:33,760 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-10-11 08:34:44,063 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=15.0 2023-10-11 08:34:48,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=649646.6666666666, ans=0.5 2023-10-11 08:35:01,903 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=12.0 2023-10-11 08:35:06,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=649740.0, ans=0.125 2023-10-11 08:35:08,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.715e+02 1.917e+02 2.322e+02 3.352e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-11 08:35:20,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-10-11 08:35:25,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=649786.6666666666, ans=0.125 2023-10-11 08:35:26,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=649833.3333333334, ans=0.0 2023-10-11 08:35:53,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=649926.6666666666, ans=0.09899494936611666 2023-10-11 08:35:58,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=649926.6666666666, ans=0.2 2023-10-11 08:36:07,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=649973.3333333334, ans=0.125 2023-10-11 08:37:08,300 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.687e+02 1.856e+02 2.162e+02 3.629e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-11 08:37:13,274 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:37:20,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=650253.3333333334, ans=0.2 2023-10-11 08:37:45,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=650346.6666666666, ans=0.125 2023-10-11 08:37:58,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=650393.3333333334, ans=0.125 2023-10-11 08:38:04,950 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.48 vs. limit=22.5 2023-10-11 08:38:15,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=650440.0, ans=0.0 2023-10-11 08:38:24,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=650486.6666666666, ans=0.2 2023-10-11 08:38:27,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=650486.6666666666, ans=0.0 2023-10-11 08:38:30,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.27 vs. limit=10.0 2023-10-11 08:38:31,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=650533.3333333334, ans=0.0 2023-10-11 08:38:34,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.34 vs. limit=6.0 2023-10-11 08:38:41,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=650580.0, ans=0.2 2023-10-11 08:38:57,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=650626.6666666666, ans=0.2 2023-10-11 08:39:00,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=650626.6666666666, ans=0.125 2023-10-11 08:39:11,053 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.283e+02 1.735e+02 1.922e+02 2.122e+02 3.124e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-11 08:39:42,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=650766.6666666666, ans=0.125 2023-10-11 08:39:42,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=650766.6666666666, ans=0.05 2023-10-11 08:39:48,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=650813.3333333334, ans=0.0 2023-10-11 08:39:52,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=650813.3333333334, ans=0.125 2023-10-11 08:40:00,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=650860.0, ans=0.125 2023-10-11 08:40:03,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=650860.0, ans=0.1 2023-10-11 08:40:21,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=650953.3333333334, ans=0.2 2023-10-11 08:40:29,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=651000.0, ans=0.2 2023-10-11 08:40:38,770 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:40:46,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=651046.6666666666, ans=0.125 2023-10-11 08:41:04,189 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.658e+02 1.870e+02 2.054e+02 2.597e+02, threshold=3.740e+02, percent-clipped=0.0 2023-10-11 08:41:05,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=651140.0, ans=0.125 2023-10-11 08:41:12,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=651186.6666666666, ans=0.1 2023-10-11 08:41:23,479 INFO [train.py:1031] (0/4) Epoch 11, batch 3000, loss[loss=0.2361, simple_loss=0.3163, pruned_loss=0.07793, over 16633.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2923, pruned_loss=0.05824, over 25472576.30 frames. ], batch size: 241, lr: 3.29e-03, grad_scale: 32.0 2023-10-11 08:41:40,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=651280.0, ans=0.1 2023-10-11 08:41:40,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=651280.0, ans=0.125 2023-10-11 08:41:45,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=651326.6666666666, ans=0.2 2023-10-11 08:41:49,718 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.43 vs. limit=10.0 2023-10-11 08:41:53,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=651326.6666666666, ans=0.125 2023-10-11 08:41:53,598 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.48 vs. limit=22.5 2023-10-11 08:41:55,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651373.3333333334, ans=0.1 2023-10-11 08:41:59,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=651373.3333333334, ans=0.0 2023-10-11 08:42:00,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=651373.3333333334, ans=0.125 2023-10-11 08:42:09,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=651420.0, ans=15.0 2023-10-11 08:42:23,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=651466.6666666666, ans=0.2 2023-10-11 08:42:23,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=651466.6666666666, ans=0.2 2023-10-11 08:42:24,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=651466.6666666666, ans=0.125 2023-10-11 08:42:31,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=651513.3333333334, ans=0.125 2023-10-11 08:42:32,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=651513.3333333334, ans=0.0 2023-10-11 08:42:42,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=651560.0, ans=0.2 2023-10-11 08:42:52,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.771e+02 1.903e+02 2.092e+02 3.111e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-11 08:42:56,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=651606.6666666666, ans=0.0 2023-10-11 08:43:04,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651653.3333333334, ans=0.1 2023-10-11 08:43:13,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=651700.0, ans=0.1 2023-10-11 08:43:35,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-10-11 08:43:36,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=651746.6666666666, ans=0.1 2023-10-11 08:43:41,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=651793.3333333334, ans=0.0 2023-10-11 08:43:56,035 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:44:08,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=651886.6666666666, ans=0.1 2023-10-11 08:44:12,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651933.3333333334, ans=0.1 2023-10-11 08:44:14,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=651933.3333333334, ans=0.125 2023-10-11 08:44:26,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=651980.0, ans=0.125 2023-10-11 08:44:41,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=652026.6666666666, ans=0.025 2023-10-11 08:44:45,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.678e+02 1.871e+02 2.178e+02 3.093e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-11 08:44:54,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=652120.0, ans=0.0 2023-10-11 08:45:05,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=652166.6666666666, ans=0.125 2023-10-11 08:45:19,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=652213.3333333334, ans=0.125 2023-10-11 08:45:19,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=652213.3333333334, ans=0.125 2023-10-11 08:45:22,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=652213.3333333334, ans=0.125 2023-10-11 08:45:46,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=652306.6666666666, ans=0.04949747468305833 2023-10-11 08:45:54,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=652306.6666666666, ans=0.125 2023-10-11 08:46:27,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=652446.6666666666, ans=0.125 2023-10-11 08:46:45,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=652540.0, ans=0.0 2023-10-11 08:46:47,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.661e+02 1.872e+02 2.041e+02 3.337e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-11 08:46:53,636 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-10-11 08:47:05,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=652586.6666666666, ans=0.0 2023-10-11 08:47:06,321 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.76 vs. limit=15.0 2023-10-11 08:47:11,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=652633.3333333334, ans=0.125 2023-10-11 08:47:18,068 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.28 vs. limit=15.0 2023-10-11 08:47:20,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=652680.0, ans=0.125 2023-10-11 08:47:54,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=652820.0, ans=0.1 2023-10-11 08:47:55,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-10-11 08:48:00,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=652866.6666666666, ans=0.0 2023-10-11 08:48:25,657 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-10-11 08:48:29,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=652960.0, ans=0.1 2023-10-11 08:48:34,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=653006.6666666666, ans=0.1 2023-10-11 08:48:34,239 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:48:40,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.672e+02 1.857e+02 2.103e+02 3.532e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-11 08:48:40,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=653006.6666666666, ans=0.125 2023-10-11 08:48:56,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=653053.3333333334, ans=0.0 2023-10-11 08:49:20,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=653146.6666666666, ans=0.125 2023-10-11 08:49:24,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=653193.3333333334, ans=0.1 2023-10-11 08:49:58,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=653333.3333333334, ans=0.0 2023-10-11 08:50:07,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=653380.0, ans=0.2 2023-10-11 08:50:22,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=653426.6666666666, ans=0.125 2023-10-11 08:50:29,372 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.732e+02 1.893e+02 2.224e+02 3.725e+02, threshold=3.786e+02, percent-clipped=1.0 2023-10-11 08:50:50,821 INFO [train.py:1031] (0/4) Epoch 11, batch 3500, loss[loss=0.1891, simple_loss=0.2757, pruned_loss=0.05124, over 16646.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2921, pruned_loss=0.05804, over 27103222.64 frames. ], batch size: 56, lr: 3.28e-03, grad_scale: 16.0 2023-10-11 08:50:53,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=653566.6666666666, ans=0.0 2023-10-11 08:50:53,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=653566.6666666666, ans=0.125 2023-10-11 08:51:11,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=653660.0, ans=0.0 2023-10-11 08:51:35,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=653753.3333333334, ans=0.125 2023-10-11 08:51:36,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=653753.3333333334, ans=0.0 2023-10-11 08:51:37,914 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:51:38,128 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-10-11 08:51:42,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=653753.3333333334, ans=0.5 2023-10-11 08:51:59,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=653846.6666666666, ans=0.1 2023-10-11 08:51:59,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=653846.6666666666, ans=0.09899494936611666 2023-10-11 08:52:26,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=653940.0, ans=0.1 2023-10-11 08:52:28,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.821e+02 1.990e+02 2.253e+02 3.874e+02, threshold=3.980e+02, percent-clipped=1.0 2023-10-11 08:52:47,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=654033.3333333334, ans=0.125 2023-10-11 08:52:57,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=654033.3333333334, ans=0.125 2023-10-11 08:52:58,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=654033.3333333334, ans=0.0 2023-10-11 08:53:09,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=654080.0, ans=0.125 2023-10-11 08:53:10,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=654126.6666666666, ans=0.09899494936611666 2023-10-11 08:53:24,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=654173.3333333334, ans=0.1 2023-10-11 08:53:44,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=654220.0, ans=0.125 2023-10-11 08:54:05,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=654313.3333333334, ans=0.125 2023-10-11 08:54:25,196 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.659e+02 1.908e+02 2.167e+02 3.702e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-11 08:54:27,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=654406.6666666666, ans=0.0 2023-10-11 08:54:43,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=654453.3333333334, ans=0.125 2023-10-11 08:54:57,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=654546.6666666666, ans=0.125 2023-10-11 08:55:11,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=654593.3333333334, ans=0.95 2023-10-11 08:55:24,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=654640.0, ans=0.0 2023-10-11 08:55:56,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-10-11 08:56:02,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=654780.0, ans=0.125 2023-10-11 08:56:26,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=654873.3333333334, ans=0.0 2023-10-11 08:56:27,983 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:56:28,171 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.58 vs. limit=15.0 2023-10-11 08:56:28,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.676e+02 1.839e+02 2.080e+02 2.698e+02, threshold=3.678e+02, percent-clipped=0.0 2023-10-11 08:56:46,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=654966.6666666666, ans=0.0 2023-10-11 08:56:53,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=654966.6666666666, ans=0.125 2023-10-11 08:56:58,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=655013.3333333334, ans=0.125 2023-10-11 08:57:03,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=655013.3333333334, ans=0.125 2023-10-11 08:57:08,320 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.78 vs. limit=22.5 2023-10-11 08:57:17,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=655060.0, ans=0.1 2023-10-11 08:57:19,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=655060.0, ans=0.125 2023-10-11 08:57:23,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=655106.6666666666, ans=0.125 2023-10-11 08:57:36,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.53 vs. limit=22.5 2023-10-11 08:58:17,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.602e+02 1.791e+02 1.912e+02 2.975e+02, threshold=3.581e+02, percent-clipped=0.0 2023-10-11 08:58:18,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=655340.0, ans=0.0 2023-10-11 08:58:22,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.01 vs. limit=22.5 2023-10-11 08:58:27,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=655386.6666666666, ans=0.07 2023-10-11 08:58:39,736 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.71 vs. limit=6.0 2023-10-11 08:59:24,149 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-10-11 08:59:25,846 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 08:59:32,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=655666.6666666666, ans=0.125 2023-10-11 08:59:49,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=655713.3333333334, ans=0.0 2023-10-11 08:59:50,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=655713.3333333334, ans=0.2 2023-10-11 08:59:58,732 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-10-11 09:00:08,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.651e+02 1.830e+02 2.035e+02 2.608e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-11 09:00:08,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=655806.6666666666, ans=0.035 2023-10-11 09:00:18,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=655853.3333333334, ans=0.125 2023-10-11 09:00:26,792 INFO [train.py:1031] (0/4) Epoch 11, batch 4000, loss[loss=0.1903, simple_loss=0.2789, pruned_loss=0.05088, over 16623.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2917, pruned_loss=0.05797, over 28370752.97 frames. ], batch size: 56, lr: 3.28e-03, grad_scale: 32.0 2023-10-11 09:00:38,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.44 vs. limit=22.5 2023-10-11 09:01:03,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=656040.0, ans=0.0 2023-10-11 09:01:20,096 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.73 vs. limit=15.0 2023-10-11 09:01:24,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=656133.3333333334, ans=0.125 2023-10-11 09:01:30,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=656133.3333333334, ans=0.1 2023-10-11 09:01:36,308 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:02:00,520 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.747e+02 1.882e+02 2.164e+02 3.100e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-11 09:02:02,233 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.06 vs. limit=22.5 2023-10-11 09:02:08,808 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2023-10-11 09:02:11,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=656320.0, ans=0.125 2023-10-11 09:02:11,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656320.0, ans=0.1 2023-10-11 09:02:22,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.40 vs. limit=15.0 2023-10-11 09:02:22,800 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.27 vs. limit=22.5 2023-10-11 09:02:26,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=656366.6666666666, ans=0.1 2023-10-11 09:02:30,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656413.3333333334, ans=0.1 2023-10-11 09:02:46,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=656460.0, ans=0.0 2023-10-11 09:02:46,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=656460.0, ans=0.2 2023-10-11 09:03:01,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.01 vs. limit=15.0 2023-10-11 09:03:08,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.06 vs. limit=15.0 2023-10-11 09:03:23,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=656600.0, ans=0.0 2023-10-11 09:03:24,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=656600.0, ans=0.2 2023-10-11 09:03:54,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=656693.3333333334, ans=0.035 2023-10-11 09:04:04,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=656740.0, ans=0.125 2023-10-11 09:04:06,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.736e+02 1.997e+02 2.260e+02 4.004e+02, threshold=3.993e+02, percent-clipped=2.0 2023-10-11 09:04:16,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=656786.6666666666, ans=0.0 2023-10-11 09:04:40,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=656880.0, ans=0.0 2023-10-11 09:04:55,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=656926.6666666666, ans=0.2 2023-10-11 09:05:03,253 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:05:06,233 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-10-11 09:05:20,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=657020.0, ans=0.125 2023-10-11 09:05:30,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=657066.6666666666, ans=0.125 2023-10-11 09:05:31,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=657066.6666666666, ans=0.125 2023-10-11 09:05:42,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.14 vs. limit=15.0 2023-10-11 09:05:54,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657206.6666666666, ans=0.1 2023-10-11 09:05:58,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.645e+02 1.848e+02 2.109e+02 3.413e+02, threshold=3.696e+02, percent-clipped=0.0 2023-10-11 09:06:02,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=657206.6666666666, ans=0.125 2023-10-11 09:06:04,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=657253.3333333334, ans=0.0 2023-10-11 09:06:42,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=657393.3333333334, ans=0.95 2023-10-11 09:06:42,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=657393.3333333334, ans=0.2 2023-10-11 09:06:46,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=657393.3333333334, ans=0.2 2023-10-11 09:06:53,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.60 vs. limit=15.0 2023-10-11 09:07:05,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.34 vs. limit=22.5 2023-10-11 09:07:11,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=15.0 2023-10-11 09:07:11,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657533.3333333334, ans=0.1 2023-10-11 09:07:27,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=657580.0, ans=0.125 2023-10-11 09:07:52,079 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.793e+02 2.052e+02 2.305e+02 3.332e+02, threshold=4.104e+02, percent-clipped=0.0 2023-10-11 09:07:55,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=657673.3333333334, ans=0.125 2023-10-11 09:07:56,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=657673.3333333334, ans=0.125 2023-10-11 09:07:58,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.52 vs. limit=15.0 2023-10-11 09:08:15,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657766.6666666666, ans=0.1 2023-10-11 09:08:21,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=657813.3333333334, ans=0.125 2023-10-11 09:08:25,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=657813.3333333334, ans=0.125 2023-10-11 09:08:28,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=657813.3333333334, ans=0.125 2023-10-11 09:08:31,898 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-10-11 09:08:41,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=657860.0, ans=0.125 2023-10-11 09:08:57,206 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.41 vs. limit=22.5 2023-10-11 09:09:06,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=657953.3333333334, ans=0.0 2023-10-11 09:09:08,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=657953.3333333334, ans=0.1 2023-10-11 09:09:19,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=658000.0, ans=0.2 2023-10-11 09:09:31,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.14 vs. limit=10.0 2023-10-11 09:09:51,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.680e+02 1.845e+02 2.079e+02 3.675e+02, threshold=3.690e+02, percent-clipped=0.0 2023-10-11 09:10:08,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.12 vs. limit=15.0 2023-10-11 09:10:10,624 INFO [train.py:1031] (0/4) Epoch 11, batch 4500, loss[loss=0.1918, simple_loss=0.2849, pruned_loss=0.0494, over 16647.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2922, pruned_loss=0.05782, over 29381682.09 frames. ], batch size: 202, lr: 3.27e-03, grad_scale: 32.0 2023-10-11 09:10:26,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.35 vs. limit=10.0 2023-10-11 09:10:37,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.92 vs. limit=10.0 2023-10-11 09:10:51,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=658373.3333333334, ans=0.125 2023-10-11 09:11:10,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=658466.6666666666, ans=0.125 2023-10-11 09:11:40,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.685e+02 1.846e+02 2.006e+02 3.057e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-11 09:11:58,890 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.65 vs. limit=22.5 2023-10-11 09:12:10,352 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.18 vs. limit=15.0 2023-10-11 09:13:04,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=658980.0, ans=0.2 2023-10-11 09:13:10,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=658980.0, ans=0.2 2023-10-11 09:13:19,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=659026.6666666666, ans=0.0 2023-10-11 09:13:27,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=659026.6666666666, ans=0.1 2023-10-11 09:13:27,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-10-11 09:13:34,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.738e+02 1.927e+02 2.249e+02 3.392e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-11 09:13:34,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=659073.3333333334, ans=0.125 2023-10-11 09:13:36,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=659073.3333333334, ans=0.125 2023-10-11 09:13:48,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=659120.0, ans=0.2 2023-10-11 09:14:55,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=659400.0, ans=0.125 2023-10-11 09:14:57,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=659400.0, ans=0.1 2023-10-11 09:15:02,201 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.60 vs. limit=15.0 2023-10-11 09:15:23,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=659540.0, ans=0.0 2023-10-11 09:15:24,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=659540.0, ans=0.1 2023-10-11 09:15:25,738 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.692e+02 1.899e+02 2.139e+02 3.137e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-11 09:16:26,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=659773.3333333334, ans=0.125 2023-10-11 09:16:36,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=659820.0, ans=0.2 2023-10-11 09:16:49,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2023-10-11 09:16:56,817 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=12.0 2023-10-11 09:17:07,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=659960.0, ans=0.1 2023-10-11 09:17:09,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.10 vs. limit=15.0 2023-10-11 09:17:15,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=659960.0, ans=0.0 2023-10-11 09:17:19,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=660006.6666666666, ans=0.0 2023-10-11 09:17:21,566 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.737e+02 1.877e+02 2.122e+02 2.625e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-11 09:17:21,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=660006.6666666666, ans=0.2 2023-10-11 09:17:30,436 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.23 vs. limit=22.5 2023-10-11 09:17:46,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=660100.0, ans=0.1 2023-10-11 09:17:48,612 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=15.0 2023-10-11 09:17:56,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=660146.6666666666, ans=0.1 2023-10-11 09:18:03,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=660193.3333333334, ans=10.0 2023-10-11 09:18:05,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=660193.3333333334, ans=0.2 2023-10-11 09:18:09,427 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-10-11 09:18:21,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=660240.0, ans=0.125 2023-10-11 09:18:51,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=660333.3333333334, ans=0.0 2023-10-11 09:18:52,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.34 vs. limit=15.0 2023-10-11 09:19:01,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=660380.0, ans=0.0 2023-10-11 09:19:06,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=660426.6666666666, ans=15.0 2023-10-11 09:19:21,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.740e+02 1.897e+02 2.093e+02 2.860e+02, threshold=3.794e+02, percent-clipped=0.0 2023-10-11 09:19:32,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=660520.0, ans=0.0 2023-10-11 09:19:38,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=660566.6666666666, ans=0.2 2023-10-11 09:19:39,516 INFO [train.py:1031] (0/4) Epoch 11, batch 5000, loss[loss=0.217, simple_loss=0.3018, pruned_loss=0.06615, over 16591.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2919, pruned_loss=0.05787, over 30132503.89 frames. ], batch size: 66, lr: 3.27e-03, grad_scale: 32.0 2023-10-11 09:20:14,237 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-10-11 09:20:25,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=660753.3333333334, ans=0.125 2023-10-11 09:20:28,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=660753.3333333334, ans=0.1 2023-10-11 09:20:41,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=660800.0, ans=0.125 2023-10-11 09:20:48,041 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.36 vs. limit=22.5 2023-10-11 09:20:57,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=6.0 2023-10-11 09:21:02,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=660893.3333333334, ans=0.04949747468305833 2023-10-11 09:21:11,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.703e+02 1.873e+02 2.084e+02 3.090e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-11 09:21:22,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.67 vs. limit=15.0 2023-10-11 09:21:27,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=660986.6666666666, ans=0.0 2023-10-11 09:21:38,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=661033.3333333334, ans=0.125 2023-10-11 09:21:54,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=661126.6666666666, ans=0.0 2023-10-11 09:22:02,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=661126.6666666666, ans=0.1 2023-10-11 09:22:10,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=661173.3333333334, ans=0.05 2023-10-11 09:22:38,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=661313.3333333334, ans=0.125 2023-10-11 09:22:57,026 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:23:03,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.747e+02 1.923e+02 2.253e+02 3.251e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-11 09:23:17,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.03 vs. limit=15.0 2023-10-11 09:23:19,432 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-10-11 09:23:37,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=661546.6666666666, ans=0.1 2023-10-11 09:23:40,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=661546.6666666666, ans=0.0 2023-10-11 09:23:43,409 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:23:45,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=661593.3333333334, ans=0.0 2023-10-11 09:23:49,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=661593.3333333334, ans=0.125 2023-10-11 09:24:07,771 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:24:10,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=661686.6666666666, ans=0.125 2023-10-11 09:24:22,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=661733.3333333334, ans=0.04949747468305833 2023-10-11 09:24:25,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=661733.3333333334, ans=0.2 2023-10-11 09:24:43,164 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.75 vs. limit=15.0 2023-10-11 09:24:54,821 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.670e+02 1.796e+02 1.975e+02 2.752e+02, threshold=3.592e+02, percent-clipped=0.0 2023-10-11 09:25:06,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=661920.0, ans=0.125 2023-10-11 09:25:22,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=662013.3333333334, ans=0.0 2023-10-11 09:25:30,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=662013.3333333334, ans=0.1 2023-10-11 09:26:11,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=662200.0, ans=0.1 2023-10-11 09:26:11,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=662200.0, ans=0.0 2023-10-11 09:26:11,408 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-10-11 09:26:29,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=662246.6666666666, ans=0.07 2023-10-11 09:26:38,195 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:26:44,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=662340.0, ans=0.0 2023-10-11 09:26:44,493 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:26:45,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=662340.0, ans=0.0 2023-10-11 09:26:47,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=662340.0, ans=0.0 2023-10-11 09:26:47,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.694e+02 1.904e+02 2.332e+02 3.854e+02, threshold=3.808e+02, percent-clipped=1.0 2023-10-11 09:26:59,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=662386.6666666666, ans=0.2 2023-10-11 09:27:02,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=662386.6666666666, ans=0.125 2023-10-11 09:27:04,758 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=15.0 2023-10-11 09:27:09,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=662433.3333333334, ans=0.125 2023-10-11 09:27:52,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=662620.0, ans=0.1 2023-10-11 09:28:01,449 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.20 vs. limit=12.0 2023-10-11 09:28:07,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=662666.6666666666, ans=0.0 2023-10-11 09:28:11,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=12.0 2023-10-11 09:28:23,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=662760.0, ans=0.125 2023-10-11 09:28:34,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=662806.6666666666, ans=0.2 2023-10-11 09:28:36,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-10-11 09:28:37,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.718e+02 1.902e+02 2.149e+02 3.083e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-11 09:28:37,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=662806.6666666666, ans=0.125 2023-10-11 09:28:40,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=662806.6666666666, ans=0.125 2023-10-11 09:28:46,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=662853.3333333334, ans=0.2 2023-10-11 09:28:54,011 INFO [train.py:1031] (0/4) Epoch 11, batch 5500, loss[loss=0.1847, simple_loss=0.2726, pruned_loss=0.04842, over 16394.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2917, pruned_loss=0.0578, over 30703714.86 frames. ], batch size: 50, lr: 3.26e-03, grad_scale: 32.0 2023-10-11 09:28:57,156 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.82 vs. limit=15.0 2023-10-11 09:29:08,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=662946.6666666666, ans=0.0 2023-10-11 09:29:09,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=662946.6666666666, ans=0.2 2023-10-11 09:29:10,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=662946.6666666666, ans=0.125 2023-10-11 09:29:13,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=662993.3333333334, ans=0.125 2023-10-11 09:29:17,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=662993.3333333334, ans=0.1 2023-10-11 09:29:26,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=663040.0, ans=6.0 2023-10-11 09:29:27,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=663040.0, ans=0.125 2023-10-11 09:29:38,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=663086.6666666666, ans=0.125 2023-10-11 09:29:52,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=663133.3333333334, ans=0.0 2023-10-11 09:29:59,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=663180.0, ans=0.05 2023-10-11 09:30:00,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=663180.0, ans=0.0 2023-10-11 09:30:01,896 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.84 vs. limit=22.5 2023-10-11 09:30:13,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=663226.6666666666, ans=0.07 2023-10-11 09:30:19,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=663273.3333333334, ans=0.0 2023-10-11 09:30:22,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.708e+02 1.946e+02 2.384e+02 4.404e+02, threshold=3.893e+02, percent-clipped=1.0 2023-10-11 09:30:23,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=663273.3333333334, ans=0.125 2023-10-11 09:30:28,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=663320.0, ans=0.125 2023-10-11 09:30:51,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=663413.3333333334, ans=0.125 2023-10-11 09:30:52,011 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-10-11 09:31:02,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=663460.0, ans=0.2 2023-10-11 09:31:05,908 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:31:08,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=663460.0, ans=0.0 2023-10-11 09:31:12,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=663506.6666666666, ans=0.09899494936611666 2023-10-11 09:31:16,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=663506.6666666666, ans=0.125 2023-10-11 09:31:19,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=663506.6666666666, ans=0.0 2023-10-11 09:31:21,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=663506.6666666666, ans=0.0 2023-10-11 09:31:22,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=663553.3333333334, ans=0.125 2023-10-11 09:31:37,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=663600.0, ans=0.0 2023-10-11 09:31:50,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=663646.6666666666, ans=0.0 2023-10-11 09:32:07,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=663740.0, ans=0.0 2023-10-11 09:32:12,839 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.744e+02 1.949e+02 2.169e+02 3.030e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-11 09:32:38,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=663833.3333333334, ans=0.1 2023-10-11 09:32:49,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-10-11 09:32:49,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=663880.0, ans=0.1 2023-10-11 09:32:49,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=663880.0, ans=0.0 2023-10-11 09:32:55,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=663880.0, ans=0.05 2023-10-11 09:32:58,656 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-11 09:33:00,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=663926.6666666666, ans=10.0 2023-10-11 09:33:05,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=663926.6666666666, ans=0.125 2023-10-11 09:33:26,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=664020.0, ans=0.0 2023-10-11 09:33:40,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=664113.3333333334, ans=0.125 2023-10-11 09:33:41,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.79 vs. limit=22.5 2023-10-11 09:33:51,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=664160.0, ans=0.0 2023-10-11 09:34:09,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.688e+02 1.867e+02 2.091e+02 3.041e+02, threshold=3.735e+02, percent-clipped=0.0 2023-10-11 09:34:13,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=664206.6666666666, ans=0.125 2023-10-11 09:34:14,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=664206.6666666666, ans=0.0 2023-10-11 09:34:27,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=664300.0, ans=0.1 2023-10-11 09:34:32,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=664300.0, ans=0.2 2023-10-11 09:34:54,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=664393.3333333334, ans=0.125 2023-10-11 09:35:02,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=664440.0, ans=0.125 2023-10-11 09:35:06,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=664440.0, ans=0.0 2023-10-11 09:35:18,185 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.72 vs. limit=15.0 2023-10-11 09:35:18,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=664486.6666666666, ans=0.2 2023-10-11 09:35:29,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=664533.3333333334, ans=0.125 2023-10-11 09:35:37,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=664580.0, ans=0.125 2023-10-11 09:35:50,681 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.11 vs. limit=15.0 2023-10-11 09:36:05,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.770e+02 2.003e+02 2.348e+02 3.357e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-11 09:36:08,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=664673.3333333334, ans=0.2 2023-10-11 09:36:18,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=664720.0, ans=0.0 2023-10-11 09:36:25,740 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 09:36:37,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=664813.3333333334, ans=0.125 2023-10-11 09:37:08,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.90 vs. limit=6.0 2023-10-11 09:37:10,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=664953.3333333334, ans=0.0 2023-10-11 09:37:14,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=664953.3333333334, ans=0.125 2023-10-11 09:37:35,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=665046.6666666666, ans=0.1 2023-10-11 09:37:44,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.11 vs. limit=15.0 2023-10-11 09:37:47,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=665093.3333333334, ans=0.1 2023-10-11 09:37:48,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=665093.3333333334, ans=0.035 2023-10-11 09:37:54,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.660e+02 1.851e+02 2.129e+02 3.305e+02, threshold=3.703e+02, percent-clipped=0.0 2023-10-11 09:38:04,724 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-10-11 09:38:13,261 INFO [train.py:1031] (0/4) Epoch 11, batch 6000, loss[loss=0.2018, simple_loss=0.2931, pruned_loss=0.05525, over 16851.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2922, pruned_loss=0.05797, over 31190960.36 frames. ], batch size: 72, lr: 3.25e-03, grad_scale: 32.0 2023-10-11 09:38:36,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=665326.6666666666, ans=0.125 2023-10-11 09:38:51,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=665373.3333333334, ans=0.0 2023-10-11 09:38:55,407 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.21 vs. limit=15.0 2023-10-11 09:39:08,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=665466.6666666666, ans=0.0 2023-10-11 09:39:46,164 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.735e+02 1.878e+02 2.158e+02 3.132e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-11 09:39:48,906 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.01 vs. limit=15.0 2023-10-11 09:40:23,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=665793.3333333334, ans=0.95 2023-10-11 09:40:25,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=665793.3333333334, ans=0.125 2023-10-11 09:40:41,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=665840.0, ans=0.0 2023-10-11 09:40:48,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=665886.6666666666, ans=0.125 2023-10-11 09:40:56,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.73 vs. limit=15.0 2023-10-11 09:41:14,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=665980.0, ans=0.0 2023-10-11 09:41:27,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-11 09:41:33,172 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2023-10-11 09:41:34,809 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-10-11 09:41:34,978 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2023-10-11 09:41:35,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.742e+02 1.917e+02 2.156e+02 2.972e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-11 09:42:06,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=666213.3333333334, ans=0.1 2023-10-11 09:42:06,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=666213.3333333334, ans=0.125 2023-10-11 09:42:29,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=666306.6666666666, ans=0.1 2023-10-11 09:42:35,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=666353.3333333334, ans=10.0 2023-10-11 09:42:44,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=666353.3333333334, ans=0.1 2023-10-11 09:42:52,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=666400.0, ans=0.125 2023-10-11 09:42:55,115 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.31 vs. limit=10.0 2023-10-11 09:42:56,094 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-10-11 09:43:20,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=666493.3333333334, ans=0.125 2023-10-11 09:43:25,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=666540.0, ans=0.0 2023-10-11 09:43:27,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=12.0 2023-10-11 09:43:30,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.702e+02 1.919e+02 2.149e+02 3.434e+02, threshold=3.839e+02, percent-clipped=0.0 2023-10-11 09:43:34,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=666540.0, ans=0.0 2023-10-11 09:43:39,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=666586.6666666666, ans=0.125 2023-10-11 09:43:53,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=666633.3333333334, ans=0.125 2023-10-11 09:44:12,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.82 vs. limit=15.0 2023-10-11 09:44:26,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=666773.3333333334, ans=0.125 2023-10-11 09:44:31,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=666773.3333333334, ans=0.0 2023-10-11 09:45:06,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=666913.3333333334, ans=15.0 2023-10-11 09:45:30,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.696e+02 1.908e+02 2.183e+02 3.331e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-11 09:45:48,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=667100.0, ans=0.0 2023-10-11 09:45:59,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=667146.6666666666, ans=0.0 2023-10-11 09:46:33,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=667286.6666666666, ans=0.125 2023-10-11 09:46:46,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=667333.3333333334, ans=0.125 2023-10-11 09:46:47,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=667333.3333333334, ans=0.125 2023-10-11 09:47:20,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=667473.3333333334, ans=0.125 2023-10-11 09:47:24,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.716e+02 1.900e+02 2.198e+02 3.623e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-11 09:47:28,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=667473.3333333334, ans=0.125 2023-10-11 09:47:42,521 INFO [train.py:1031] (0/4) Epoch 11, batch 6500, loss[loss=0.2093, simple_loss=0.3037, pruned_loss=0.05743, over 16581.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2928, pruned_loss=0.05831, over 31540952.42 frames. ], batch size: 241, lr: 3.25e-03, grad_scale: 32.0 2023-10-11 09:47:56,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=667613.3333333334, ans=0.1 2023-10-11 09:48:05,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=667613.3333333334, ans=0.125 2023-10-11 09:48:20,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=667660.0, ans=0.1 2023-10-11 09:48:33,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=667706.6666666666, ans=0.125 2023-10-11 09:48:36,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=667753.3333333334, ans=0.125 2023-10-11 09:48:40,271 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.03 vs. limit=15.0 2023-10-11 09:49:08,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=667846.6666666666, ans=0.0 2023-10-11 09:49:22,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=667893.3333333334, ans=0.2 2023-10-11 09:49:24,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=667893.3333333334, ans=0.2 2023-10-11 09:49:30,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=667940.0, ans=0.125 2023-10-11 09:49:31,235 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.696e+02 1.869e+02 2.144e+02 2.746e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-11 09:49:46,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.96 vs. limit=22.5 2023-10-11 09:49:55,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=668033.3333333334, ans=0.125 2023-10-11 09:49:55,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=668033.3333333334, ans=0.125 2023-10-11 09:49:56,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=668033.3333333334, ans=0.125 2023-10-11 09:50:00,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=668080.0, ans=0.125 2023-10-11 09:50:03,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.36 vs. limit=15.0 2023-10-11 09:50:03,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=668080.0, ans=0.0 2023-10-11 09:50:05,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=668080.0, ans=0.0 2023-10-11 09:50:12,394 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.91 vs. limit=12.0 2023-10-11 09:50:13,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=668126.6666666666, ans=15.0 2023-10-11 09:50:23,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=668173.3333333334, ans=0.1 2023-10-11 09:50:31,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=668220.0, ans=0.125 2023-10-11 09:50:56,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=668313.3333333334, ans=0.2 2023-10-11 09:51:02,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=668313.3333333334, ans=0.0 2023-10-11 09:51:11,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=668360.0, ans=0.95 2023-10-11 09:51:20,517 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.701e+02 1.831e+02 2.005e+02 2.995e+02, threshold=3.663e+02, percent-clipped=0.0 2023-10-11 09:51:20,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=668406.6666666666, ans=0.0 2023-10-11 09:51:22,984 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-10-11 09:51:28,786 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.12 vs. limit=10.0 2023-10-11 09:51:32,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=668453.3333333334, ans=0.0 2023-10-11 09:52:13,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=668640.0, ans=0.125 2023-10-11 09:52:57,148 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-10-11 09:53:16,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.743e+02 1.915e+02 2.188e+02 3.013e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-11 09:53:18,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=668873.3333333334, ans=0.0 2023-10-11 09:53:19,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=8.0 2023-10-11 09:53:19,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=668873.3333333334, ans=0.125 2023-10-11 09:53:30,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=668920.0, ans=0.125 2023-10-11 09:54:31,812 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.40 vs. limit=12.0 2023-10-11 09:55:00,436 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-10-11 09:55:12,710 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-10-11 09:55:14,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=669293.3333333334, ans=0.125 2023-10-11 09:55:23,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.633e+02 1.765e+02 2.073e+02 2.917e+02, threshold=3.530e+02, percent-clipped=0.0 2023-10-11 09:55:31,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=669386.6666666666, ans=0.0 2023-10-11 09:55:51,452 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-10-11 09:55:55,722 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-10-11 09:56:00,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=669480.0, ans=0.1 2023-10-11 09:56:12,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=669573.3333333334, ans=0.0 2023-10-11 09:56:13,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=669573.3333333334, ans=0.125 2023-10-11 09:56:15,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=669573.3333333334, ans=0.125 2023-10-11 09:56:47,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=669713.3333333334, ans=0.125 2023-10-11 09:56:52,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=669713.3333333334, ans=0.0 2023-10-11 09:56:55,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=669713.3333333334, ans=0.1 2023-10-11 09:57:05,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=669760.0, ans=0.125 2023-10-11 09:57:14,197 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.745e+02 2.058e+02 2.383e+02 3.136e+02, threshold=4.116e+02, percent-clipped=0.0 2023-10-11 09:57:28,084 INFO [train.py:1031] (0/4) Epoch 11, batch 7000, loss[loss=0.1963, simple_loss=0.2925, pruned_loss=0.05006, over 16912.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2933, pruned_loss=0.05818, over 31849423.54 frames. ], batch size: 104, lr: 3.24e-03, grad_scale: 16.0 2023-10-11 09:57:39,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=669946.6666666666, ans=0.0 2023-10-11 09:57:44,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=669946.6666666666, ans=0.125 2023-10-11 09:57:58,232 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.72 vs. limit=10.0 2023-10-11 09:58:04,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=670040.0, ans=0.125 2023-10-11 09:58:39,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=670180.0, ans=15.0 2023-10-11 09:58:59,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=670273.3333333334, ans=0.0 2023-10-11 09:59:03,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.661e+02 1.822e+02 2.011e+02 2.684e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-11 09:59:04,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=670273.3333333334, ans=0.1 2023-10-11 09:59:08,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=670320.0, ans=0.125 2023-10-11 09:59:23,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.63 vs. limit=6.0 2023-10-11 10:00:06,432 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.43 vs. limit=15.0 2023-10-11 10:00:07,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=670553.3333333334, ans=0.1 2023-10-11 10:00:12,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=670553.3333333334, ans=0.125 2023-10-11 10:00:17,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=670600.0, ans=0.07 2023-10-11 10:00:31,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=670646.6666666666, ans=0.125 2023-10-11 10:00:31,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=670646.6666666666, ans=0.0 2023-10-11 10:00:40,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=670693.3333333334, ans=0.0 2023-10-11 10:00:49,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=670740.0, ans=0.2 2023-10-11 10:00:51,503 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.700e+02 1.824e+02 1.999e+02 2.782e+02, threshold=3.649e+02, percent-clipped=0.0 2023-10-11 10:01:13,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=670833.3333333334, ans=0.125 2023-10-11 10:01:45,113 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-10-11 10:02:14,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=671020.0, ans=0.0 2023-10-11 10:02:32,332 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:03:02,234 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.704e+02 1.836e+02 2.086e+02 3.425e+02, threshold=3.672e+02, percent-clipped=0.0 2023-10-11 10:03:06,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=671253.3333333334, ans=0.125 2023-10-11 10:03:10,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.99 vs. limit=15.0 2023-10-11 10:03:17,059 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:04:11,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=671486.6666666666, ans=0.2 2023-10-11 10:04:18,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=671533.3333333334, ans=0.0 2023-10-11 10:04:22,258 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.19 vs. limit=15.0 2023-10-11 10:04:23,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=671533.3333333334, ans=0.1 2023-10-11 10:04:26,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.28 vs. limit=12.0 2023-10-11 10:04:27,235 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.40 vs. limit=22.5 2023-10-11 10:04:59,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.705e+02 1.901e+02 2.252e+02 3.323e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-11 10:05:14,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671766.6666666666, ans=0.1 2023-10-11 10:05:34,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=671860.0, ans=0.0 2023-10-11 10:05:39,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=671860.0, ans=0.0 2023-10-11 10:05:43,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=671860.0, ans=0.125 2023-10-11 10:05:50,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=671906.6666666666, ans=0.125 2023-10-11 10:06:03,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=671953.3333333334, ans=0.2 2023-10-11 10:06:07,402 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-144000.pt 2023-10-11 10:06:16,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=672000.0, ans=0.125 2023-10-11 10:06:22,528 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:06:45,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-10-11 10:06:51,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.752e+02 2.003e+02 2.365e+02 3.601e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-11 10:07:00,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=672186.6666666666, ans=0.125 2023-10-11 10:07:00,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=672186.6666666666, ans=0.125 2023-10-11 10:07:00,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=672186.6666666666, ans=0.0 2023-10-11 10:07:06,795 INFO [train.py:1031] (0/4) Epoch 11, batch 7500, loss[loss=0.2076, simple_loss=0.2921, pruned_loss=0.0616, over 16596.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2931, pruned_loss=0.05821, over 32028848.04 frames. ], batch size: 66, lr: 3.24e-03, grad_scale: 16.0 2023-10-11 10:07:10,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=672233.3333333334, ans=0.125 2023-10-11 10:07:41,994 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:07:50,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=672420.0, ans=0.0 2023-10-11 10:07:56,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=672420.0, ans=0.125 2023-10-11 10:08:18,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=672513.3333333334, ans=0.2 2023-10-11 10:08:22,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=672513.3333333334, ans=0.2 2023-10-11 10:08:42,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=672606.6666666666, ans=0.125 2023-10-11 10:08:44,705 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.734e+02 1.910e+02 2.143e+02 2.869e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-11 10:08:54,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=672653.3333333334, ans=0.2 2023-10-11 10:09:21,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.58 vs. limit=15.0 2023-10-11 10:09:23,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=672793.3333333334, ans=0.05 2023-10-11 10:10:01,317 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:10:03,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=672886.6666666666, ans=0.125 2023-10-11 10:10:31,479 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-10-11 10:10:48,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.632e+02 1.838e+02 2.146e+02 3.121e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-11 10:10:51,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=673120.0, ans=0.125 2023-10-11 10:11:01,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=673166.6666666666, ans=0.2 2023-10-11 10:11:09,208 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.62 vs. limit=15.0 2023-10-11 10:11:20,985 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:11:27,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=673260.0, ans=0.0 2023-10-11 10:11:40,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=673306.6666666666, ans=0.0 2023-10-11 10:12:03,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=673400.0, ans=0.1 2023-10-11 10:12:11,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=673446.6666666666, ans=0.125 2023-10-11 10:12:17,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=673493.3333333334, ans=0.1 2023-10-11 10:12:35,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.684e+02 1.863e+02 2.093e+02 3.310e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-11 10:12:43,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=673586.6666666666, ans=0.2 2023-10-11 10:12:45,027 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.42 vs. limit=22.5 2023-10-11 10:13:07,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=673680.0, ans=0.0 2023-10-11 10:13:23,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=673726.6666666666, ans=0.95 2023-10-11 10:13:50,772 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.08 vs. limit=15.0 2023-10-11 10:14:21,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=674006.6666666666, ans=0.125 2023-10-11 10:14:21,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=674006.6666666666, ans=0.0 2023-10-11 10:14:30,829 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.695e+02 1.922e+02 2.072e+02 2.924e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-11 10:14:46,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=674100.0, ans=0.125 2023-10-11 10:15:03,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=674146.6666666666, ans=0.125 2023-10-11 10:15:10,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=674193.3333333334, ans=0.2 2023-10-11 10:15:16,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=674193.3333333334, ans=0.125 2023-10-11 10:15:25,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=674240.0, ans=0.0 2023-10-11 10:15:36,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=674286.6666666666, ans=0.125 2023-10-11 10:15:57,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=674380.0, ans=0.1 2023-10-11 10:16:28,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.591e+02 1.735e+02 1.970e+02 2.940e+02, threshold=3.470e+02, percent-clipped=0.0 2023-10-11 10:16:31,962 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:16:34,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=674520.0, ans=0.2 2023-10-11 10:16:40,568 INFO [train.py:1031] (0/4) Epoch 11, batch 8000, loss[loss=0.2203, simple_loss=0.3089, pruned_loss=0.06587, over 16629.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2924, pruned_loss=0.05768, over 32185157.95 frames. ], batch size: 219, lr: 3.23e-03, grad_scale: 32.0 2023-10-11 10:16:40,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=674566.6666666666, ans=0.125 2023-10-11 10:16:54,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-10-11 10:16:59,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=674613.3333333334, ans=0.2 2023-10-11 10:17:03,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=674660.0, ans=0.0 2023-10-11 10:17:06,423 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.28 vs. limit=15.0 2023-10-11 10:17:07,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=674660.0, ans=0.125 2023-10-11 10:17:08,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=674660.0, ans=0.1 2023-10-11 10:17:26,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=674753.3333333334, ans=0.2 2023-10-11 10:17:33,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=674800.0, ans=0.1 2023-10-11 10:17:36,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=674800.0, ans=0.0 2023-10-11 10:18:01,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=674893.3333333334, ans=0.0 2023-10-11 10:18:04,837 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=12.0 2023-10-11 10:18:13,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.266e+02 1.642e+02 1.794e+02 2.027e+02 3.386e+02, threshold=3.588e+02, percent-clipped=0.0 2023-10-11 10:18:21,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.75 vs. limit=15.0 2023-10-11 10:18:27,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=675033.3333333334, ans=0.125 2023-10-11 10:18:32,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.31 vs. limit=12.0 2023-10-11 10:18:43,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=675080.0, ans=0.125 2023-10-11 10:18:47,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=675126.6666666666, ans=0.125 2023-10-11 10:18:48,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=675126.6666666666, ans=0.125 2023-10-11 10:18:48,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=675126.6666666666, ans=0.125 2023-10-11 10:18:49,814 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:18:53,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=675126.6666666666, ans=0.125 2023-10-11 10:18:57,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.00 vs. limit=22.5 2023-10-11 10:18:57,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=675126.6666666666, ans=0.2 2023-10-11 10:19:22,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=675220.0, ans=0.1 2023-10-11 10:19:24,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=675220.0, ans=0.125 2023-10-11 10:19:58,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=675313.3333333334, ans=0.025 2023-10-11 10:20:08,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=675360.0, ans=0.125 2023-10-11 10:20:22,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.697e+02 1.814e+02 2.042e+02 2.908e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-11 10:20:33,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=675453.3333333334, ans=0.125 2023-10-11 10:20:39,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=675500.0, ans=0.125 2023-10-11 10:20:42,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=675500.0, ans=0.0 2023-10-11 10:20:53,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.87 vs. limit=15.0 2023-10-11 10:21:00,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=675593.3333333334, ans=10.0 2023-10-11 10:21:09,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=675593.3333333334, ans=0.0 2023-10-11 10:21:13,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=675640.0, ans=0.0 2023-10-11 10:21:18,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=675640.0, ans=0.1 2023-10-11 10:21:20,972 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:21:23,096 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=22.5 2023-10-11 10:21:46,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=675780.0, ans=0.0 2023-10-11 10:22:03,982 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:22:05,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.48 vs. limit=15.0 2023-10-11 10:22:07,904 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.75 vs. limit=15.0 2023-10-11 10:22:14,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.687e+02 1.833e+02 2.058e+02 3.511e+02, threshold=3.666e+02, percent-clipped=0.0 2023-10-11 10:22:28,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=675920.0, ans=0.1 2023-10-11 10:22:36,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=675966.6666666666, ans=0.125 2023-10-11 10:22:49,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=676013.3333333334, ans=0.125 2023-10-11 10:23:04,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=676106.6666666666, ans=0.1 2023-10-11 10:23:14,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=676106.6666666666, ans=0.0 2023-10-11 10:23:15,507 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:23:20,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=676153.3333333334, ans=0.125 2023-10-11 10:23:31,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=676200.0, ans=0.1 2023-10-11 10:23:35,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=676200.0, ans=0.125 2023-10-11 10:24:06,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.681e+02 1.924e+02 2.224e+02 4.002e+02, threshold=3.848e+02, percent-clipped=3.0 2023-10-11 10:25:03,885 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:25:12,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=676620.0, ans=0.0 2023-10-11 10:25:12,305 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=22.5 2023-10-11 10:25:37,135 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.37 vs. limit=15.0 2023-10-11 10:25:56,906 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-10-11 10:25:58,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=676806.6666666666, ans=0.1 2023-10-11 10:26:02,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.669e+02 1.852e+02 2.064e+02 2.874e+02, threshold=3.704e+02, percent-clipped=0.0 2023-10-11 10:26:19,551 INFO [train.py:1031] (0/4) Epoch 11, batch 8500, loss[loss=0.2229, simple_loss=0.3055, pruned_loss=0.07011, over 16845.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2927, pruned_loss=0.05761, over 32342697.78 frames. ], batch size: 188, lr: 3.23e-03, grad_scale: 32.0 2023-10-11 10:26:28,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.98 vs. limit=12.0 2023-10-11 10:26:38,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=676946.6666666666, ans=0.125 2023-10-11 10:26:39,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=676946.6666666666, ans=0.0 2023-10-11 10:26:41,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=676993.3333333334, ans=0.2 2023-10-11 10:26:46,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=676993.3333333334, ans=0.5 2023-10-11 10:26:52,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=677040.0, ans=0.125 2023-10-11 10:27:06,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=677086.6666666666, ans=0.0 2023-10-11 10:27:17,602 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.09 vs. limit=15.0 2023-10-11 10:27:18,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=677133.3333333334, ans=0.2 2023-10-11 10:27:18,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=677133.3333333334, ans=0.125 2023-10-11 10:27:29,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=677180.0, ans=0.1 2023-10-11 10:27:41,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=677226.6666666666, ans=0.125 2023-10-11 10:27:59,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.725e+02 1.888e+02 2.091e+02 2.656e+02, threshold=3.776e+02, percent-clipped=0.0 2023-10-11 10:28:32,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=677413.3333333334, ans=0.125 2023-10-11 10:28:37,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=677413.3333333334, ans=0.125 2023-10-11 10:29:02,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=677506.6666666666, ans=0.125 2023-10-11 10:29:06,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=677553.3333333334, ans=0.2 2023-10-11 10:29:08,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=677553.3333333334, ans=0.0 2023-10-11 10:29:19,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=677600.0, ans=0.125 2023-10-11 10:29:20,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=677600.0, ans=0.1 2023-10-11 10:29:27,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=677646.6666666666, ans=0.125 2023-10-11 10:29:47,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=677693.3333333334, ans=0.2 2023-10-11 10:30:03,231 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.607e+02 1.790e+02 2.006e+02 2.754e+02, threshold=3.579e+02, percent-clipped=0.0 2023-10-11 10:30:19,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2023-10-11 10:30:23,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=677833.3333333334, ans=0.0 2023-10-11 10:30:32,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=677880.0, ans=0.125 2023-10-11 10:30:34,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=677880.0, ans=0.125 2023-10-11 10:30:43,808 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.35 vs. limit=22.5 2023-10-11 10:30:51,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=677973.3333333334, ans=0.1 2023-10-11 10:31:02,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=677973.3333333334, ans=0.0 2023-10-11 10:31:03,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-10-11 10:31:05,103 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-10-11 10:31:33,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=678113.3333333334, ans=0.125 2023-10-11 10:31:46,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=678160.0, ans=0.1 2023-10-11 10:31:51,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=678160.0, ans=0.0 2023-10-11 10:31:55,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=678206.6666666666, ans=0.125 2023-10-11 10:31:56,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=678206.6666666666, ans=0.125 2023-10-11 10:31:57,613 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.15 vs. limit=6.0 2023-10-11 10:32:00,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=678206.6666666666, ans=0.2 2023-10-11 10:32:02,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.546e+02 1.712e+02 1.966e+02 3.960e+02, threshold=3.425e+02, percent-clipped=1.0 2023-10-11 10:32:07,938 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.37 vs. limit=6.0 2023-10-11 10:32:15,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=678300.0, ans=0.1 2023-10-11 10:32:22,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=678300.0, ans=0.5 2023-10-11 10:32:39,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=678393.3333333334, ans=0.0 2023-10-11 10:32:49,014 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-10-11 10:33:04,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=678486.6666666666, ans=0.1 2023-10-11 10:33:06,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=678533.3333333334, ans=0.125 2023-10-11 10:33:18,727 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-10-11 10:33:19,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=678580.0, ans=0.1 2023-10-11 10:33:38,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.44 vs. limit=15.0 2023-10-11 10:33:43,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=678673.3333333334, ans=0.125 2023-10-11 10:33:49,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.624e+02 1.784e+02 2.011e+02 2.558e+02, threshold=3.568e+02, percent-clipped=0.0 2023-10-11 10:34:09,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=678766.6666666666, ans=0.125 2023-10-11 10:34:12,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=678766.6666666666, ans=0.125 2023-10-11 10:34:32,182 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.36 vs. limit=15.0 2023-10-11 10:34:37,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=678906.6666666666, ans=0.07 2023-10-11 10:34:44,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=678906.6666666666, ans=0.125 2023-10-11 10:34:58,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=679000.0, ans=0.1 2023-10-11 10:35:02,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=679000.0, ans=0.1 2023-10-11 10:35:08,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=679046.6666666666, ans=0.0 2023-10-11 10:35:11,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=679046.6666666666, ans=0.09899494936611666 2023-10-11 10:35:37,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=679140.0, ans=0.2 2023-10-11 10:35:40,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.726e+02 1.944e+02 2.100e+02 2.887e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-11 10:35:53,841 INFO [train.py:1031] (0/4) Epoch 11, batch 9000, loss[loss=0.2546, simple_loss=0.3253, pruned_loss=0.09192, over 15552.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2918, pruned_loss=0.05729, over 32427692.05 frames. ], batch size: 350, lr: 3.22e-03, grad_scale: 32.0 2023-10-11 10:35:55,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=679233.3333333334, ans=0.1 2023-10-11 10:36:10,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=679280.0, ans=0.0 2023-10-11 10:36:15,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.18 vs. limit=22.5 2023-10-11 10:36:55,309 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:37:04,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=679513.3333333334, ans=0.125 2023-10-11 10:37:14,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=679560.0, ans=0.2 2023-10-11 10:37:28,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.595e+02 1.750e+02 2.007e+02 2.919e+02, threshold=3.500e+02, percent-clipped=0.0 2023-10-11 10:37:29,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=679606.6666666666, ans=0.1 2023-10-11 10:38:14,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=679840.0, ans=0.125 2023-10-11 10:38:16,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=679840.0, ans=0.2 2023-10-11 10:38:34,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=679886.6666666666, ans=0.125 2023-10-11 10:38:48,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=679980.0, ans=0.0 2023-10-11 10:38:54,128 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.10 vs. limit=15.0 2023-10-11 10:39:00,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=680026.6666666666, ans=0.2 2023-10-11 10:39:03,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2023-10-11 10:39:10,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=680073.3333333334, ans=0.0 2023-10-11 10:39:12,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=680073.3333333334, ans=0.125 2023-10-11 10:39:15,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.731e+02 1.942e+02 2.109e+02 3.154e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-11 10:39:29,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=680166.6666666666, ans=0.0 2023-10-11 10:39:51,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=680260.0, ans=0.125 2023-10-11 10:39:55,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=680260.0, ans=0.2 2023-10-11 10:40:00,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=680306.6666666666, ans=0.0 2023-10-11 10:40:11,597 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.00 vs. limit=15.0 2023-10-11 10:41:01,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.729e+02 1.867e+02 2.056e+02 3.247e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-11 10:41:04,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=680586.6666666666, ans=0.125 2023-10-11 10:41:07,177 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.64 vs. limit=22.5 2023-10-11 10:41:36,575 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.15 vs. limit=15.0 2023-10-11 10:41:41,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=680726.6666666666, ans=0.0 2023-10-11 10:41:47,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=680773.3333333334, ans=0.09899494936611666 2023-10-11 10:42:07,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=680820.0, ans=0.1 2023-10-11 10:42:14,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=680866.6666666666, ans=0.125 2023-10-11 10:42:15,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.03 vs. limit=22.5 2023-10-11 10:42:17,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=680866.6666666666, ans=0.2 2023-10-11 10:42:20,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.80 vs. limit=22.5 2023-10-11 10:42:25,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=680913.3333333334, ans=0.0 2023-10-11 10:42:27,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=680913.3333333334, ans=0.125 2023-10-11 10:42:29,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=680913.3333333334, ans=0.0 2023-10-11 10:42:55,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=681006.6666666666, ans=0.125 2023-10-11 10:43:03,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.724e+02 1.913e+02 2.140e+02 2.952e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-11 10:43:03,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=681053.3333333334, ans=0.0 2023-10-11 10:43:06,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=681053.3333333334, ans=0.125 2023-10-11 10:43:13,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.54 vs. limit=15.0 2023-10-11 10:43:23,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=681100.0, ans=0.1 2023-10-11 10:43:26,781 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.85 vs. limit=10.0 2023-10-11 10:43:31,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=681146.6666666666, ans=0.125 2023-10-11 10:43:33,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=681146.6666666666, ans=0.125 2023-10-11 10:43:36,498 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:43:56,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=681240.0, ans=0.2 2023-10-11 10:44:18,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=681333.3333333334, ans=0.125 2023-10-11 10:44:23,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=681333.3333333334, ans=0.1 2023-10-11 10:44:28,421 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.15 vs. limit=10.0 2023-10-11 10:44:32,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=681380.0, ans=0.125 2023-10-11 10:44:32,661 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.07 vs. limit=6.0 2023-10-11 10:44:36,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=681380.0, ans=0.125 2023-10-11 10:44:40,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=681426.6666666666, ans=0.0 2023-10-11 10:44:56,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=681473.3333333334, ans=0.0 2023-10-11 10:45:04,490 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.758e+02 1.909e+02 2.214e+02 3.477e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-11 10:45:10,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=681520.0, ans=0.05 2023-10-11 10:45:12,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-11 10:45:16,589 INFO [train.py:1031] (0/4) Epoch 11, batch 9500, loss[loss=0.2033, simple_loss=0.3, pruned_loss=0.0533, over 16879.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2925, pruned_loss=0.05748, over 32496276.97 frames. ], batch size: 138, lr: 3.21e-03, grad_scale: 32.0 2023-10-11 10:45:25,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=681566.6666666666, ans=0.07 2023-10-11 10:45:27,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=681613.3333333334, ans=0.125 2023-10-11 10:45:37,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=681613.3333333334, ans=0.125 2023-10-11 10:45:39,228 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:45:53,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=681706.6666666666, ans=0.125 2023-10-11 10:46:12,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=681800.0, ans=0.1 2023-10-11 10:46:16,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=681800.0, ans=0.125 2023-10-11 10:46:25,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=681846.6666666666, ans=0.125 2023-10-11 10:46:27,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=681846.6666666666, ans=0.125 2023-10-11 10:46:38,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=681893.3333333334, ans=0.1 2023-10-11 10:46:39,333 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-10-11 10:46:43,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=681893.3333333334, ans=0.125 2023-10-11 10:46:56,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.727e+02 1.912e+02 2.111e+02 3.049e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-11 10:47:04,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=681986.6666666666, ans=0.125 2023-10-11 10:47:07,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.16 vs. limit=12.0 2023-10-11 10:47:23,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=682080.0, ans=0.125 2023-10-11 10:47:24,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=682080.0, ans=0.0 2023-10-11 10:47:52,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=682220.0, ans=0.015 2023-10-11 10:48:49,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.337e+02 1.657e+02 1.883e+02 2.135e+02 3.290e+02, threshold=3.766e+02, percent-clipped=0.0 2023-10-11 10:48:54,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=682453.3333333334, ans=0.2 2023-10-11 10:48:57,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=682453.3333333334, ans=0.125 2023-10-11 10:49:18,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=682546.6666666666, ans=0.1 2023-10-11 10:49:20,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=682546.6666666666, ans=0.02 2023-10-11 10:49:24,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.01 vs. limit=22.5 2023-10-11 10:49:40,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=682640.0, ans=0.2 2023-10-11 10:49:51,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=682686.6666666666, ans=0.125 2023-10-11 10:50:00,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.06 vs. limit=12.0 2023-10-11 10:50:02,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=682733.3333333334, ans=0.1 2023-10-11 10:50:14,379 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.97 vs. limit=10.0 2023-10-11 10:50:31,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=682873.3333333334, ans=0.125 2023-10-11 10:50:39,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=682873.3333333334, ans=0.125 2023-10-11 10:50:41,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.613e+02 1.788e+02 2.044e+02 2.619e+02, threshold=3.575e+02, percent-clipped=0.0 2023-10-11 10:50:55,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=682966.6666666666, ans=0.125 2023-10-11 10:51:12,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=683013.3333333334, ans=15.0 2023-10-11 10:51:35,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=683153.3333333334, ans=0.0 2023-10-11 10:51:42,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=683153.3333333334, ans=0.2 2023-10-11 10:52:11,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=683293.3333333334, ans=0.04949747468305833 2023-10-11 10:52:12,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=683293.3333333334, ans=0.2 2023-10-11 10:52:14,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=683293.3333333334, ans=0.2 2023-10-11 10:52:15,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=683293.3333333334, ans=0.2 2023-10-11 10:52:34,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.671e+02 1.881e+02 2.100e+02 2.917e+02, threshold=3.763e+02, percent-clipped=0.0 2023-10-11 10:52:35,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=683386.6666666666, ans=0.0 2023-10-11 10:52:35,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=683386.6666666666, ans=0.2 2023-10-11 10:52:37,068 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-10-11 10:53:14,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=683526.6666666666, ans=0.0 2023-10-11 10:53:16,188 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.42 vs. limit=6.0 2023-10-11 10:53:25,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=683573.3333333334, ans=0.125 2023-10-11 10:53:32,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=683620.0, ans=0.125 2023-10-11 10:53:34,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=683620.0, ans=0.0 2023-10-11 10:53:53,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=683713.3333333334, ans=0.125 2023-10-11 10:54:00,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=683760.0, ans=0.125 2023-10-11 10:54:02,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=683760.0, ans=0.0 2023-10-11 10:54:06,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=683760.0, ans=0.2 2023-10-11 10:54:20,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.687e+02 1.856e+02 2.075e+02 3.186e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-11 10:54:33,182 INFO [train.py:1031] (0/4) Epoch 11, batch 10000, loss[loss=0.24, simple_loss=0.3218, pruned_loss=0.0791, over 16697.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2916, pruned_loss=0.05724, over 32563803.91 frames. ], batch size: 202, lr: 3.21e-03, grad_scale: 32.0 2023-10-11 10:54:58,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=683993.3333333334, ans=0.0 2023-10-11 10:55:02,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=683993.3333333334, ans=0.0 2023-10-11 10:55:06,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=684040.0, ans=0.0 2023-10-11 10:55:28,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=684133.3333333334, ans=0.1 2023-10-11 10:55:34,777 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.53 vs. limit=15.0 2023-10-11 10:55:39,187 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-11 10:55:42,127 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-10-11 10:55:47,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=684180.0, ans=0.1 2023-10-11 10:55:50,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=684226.6666666666, ans=0.5 2023-10-11 10:55:53,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=684226.6666666666, ans=0.1 2023-10-11 10:55:54,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=684226.6666666666, ans=0.5 2023-10-11 10:56:12,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=684273.3333333334, ans=0.0 2023-10-11 10:56:14,497 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.767e+02 1.968e+02 2.230e+02 3.025e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-11 10:56:36,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=684366.6666666666, ans=0.0 2023-10-11 10:57:04,180 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=15.0 2023-10-11 10:57:07,034 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.26 vs. limit=10.0 2023-10-11 10:57:55,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.03 vs. limit=15.0 2023-10-11 10:58:04,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.810e+02 2.105e+02 2.494e+02 3.949e+02, threshold=4.211e+02, percent-clipped=1.0 2023-10-11 10:58:08,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=684786.6666666666, ans=0.0 2023-10-11 10:58:22,232 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.62 vs. limit=15.0 2023-10-11 10:58:24,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=684833.3333333334, ans=0.125 2023-10-11 10:58:33,669 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 10:58:48,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=684926.6666666666, ans=0.2 2023-10-11 10:59:05,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=684973.3333333334, ans=0.125 2023-10-11 10:59:06,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=684973.3333333334, ans=0.1 2023-10-11 10:59:14,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=685020.0, ans=0.0 2023-10-11 10:59:19,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=685066.6666666666, ans=0.125 2023-10-11 10:59:34,770 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.62 vs. limit=15.0 2023-10-11 10:59:40,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=685113.3333333334, ans=0.1 2023-10-11 10:59:47,959 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:00:07,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=685206.6666666666, ans=0.1 2023-10-11 11:00:07,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=685206.6666666666, ans=0.1 2023-10-11 11:00:07,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.625e+02 1.795e+02 2.091e+02 2.970e+02, threshold=3.591e+02, percent-clipped=0.0 2023-10-11 11:00:10,351 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:00:19,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=685300.0, ans=0.125 2023-10-11 11:00:28,012 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.07 vs. limit=15.0 2023-10-11 11:00:32,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=685300.0, ans=0.125 2023-10-11 11:00:32,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=685300.0, ans=0.125 2023-10-11 11:00:41,940 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:00:48,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=685393.3333333334, ans=0.1 2023-10-11 11:00:50,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=685393.3333333334, ans=0.0 2023-10-11 11:00:54,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=685393.3333333334, ans=0.125 2023-10-11 11:00:56,871 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.55 vs. limit=6.0 2023-10-11 11:01:01,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=685440.0, ans=10.0 2023-10-11 11:01:03,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=685440.0, ans=10.0 2023-10-11 11:01:07,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=685486.6666666666, ans=0.125 2023-10-11 11:01:27,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=685580.0, ans=0.0 2023-10-11 11:01:33,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=685580.0, ans=0.0 2023-10-11 11:01:33,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=685580.0, ans=0.2 2023-10-11 11:01:49,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=685673.3333333334, ans=0.0 2023-10-11 11:01:50,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=685673.3333333334, ans=0.125 2023-10-11 11:01:58,905 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.695e+02 1.860e+02 2.112e+02 3.160e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-11 11:01:59,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=685720.0, ans=0.0 2023-10-11 11:02:25,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=685813.3333333334, ans=0.0 2023-10-11 11:02:26,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=685813.3333333334, ans=0.125 2023-10-11 11:02:43,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-10-11 11:02:50,733 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.43 vs. limit=15.0 2023-10-11 11:03:11,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=685953.3333333334, ans=0.1 2023-10-11 11:03:32,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=686046.6666666666, ans=0.125 2023-10-11 11:03:36,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=686093.3333333334, ans=0.0 2023-10-11 11:03:54,562 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:03:56,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.260e+02 1.652e+02 1.811e+02 2.036e+02 3.092e+02, threshold=3.622e+02, percent-clipped=0.0 2023-10-11 11:03:57,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=686186.6666666666, ans=0.05 2023-10-11 11:04:07,745 INFO [train.py:1031] (0/4) Epoch 11, batch 10500, loss[loss=0.2572, simple_loss=0.3184, pruned_loss=0.09805, over 15615.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2922, pruned_loss=0.05734, over 32625993.47 frames. ], batch size: 350, lr: 3.20e-03, grad_scale: 32.0 2023-10-11 11:04:20,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=686280.0, ans=0.2 2023-10-11 11:04:24,661 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.94 vs. limit=12.0 2023-10-11 11:04:26,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=686280.0, ans=0.2 2023-10-11 11:04:44,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=686373.3333333334, ans=0.2 2023-10-11 11:04:45,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=686373.3333333334, ans=0.1 2023-10-11 11:05:01,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=686466.6666666666, ans=0.1 2023-10-11 11:05:01,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=686466.6666666666, ans=0.125 2023-10-11 11:05:13,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=686513.3333333334, ans=0.0 2023-10-11 11:05:51,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.715e+02 1.897e+02 2.272e+02 3.484e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-11 11:06:04,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.85 vs. limit=6.0 2023-10-11 11:06:23,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=686746.6666666666, ans=0.0 2023-10-11 11:06:25,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=686746.6666666666, ans=0.0 2023-10-11 11:06:41,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=686793.3333333334, ans=0.125 2023-10-11 11:07:11,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=686933.3333333334, ans=0.125 2023-10-11 11:07:19,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=686933.3333333334, ans=0.0 2023-10-11 11:07:22,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=686980.0, ans=0.2 2023-10-11 11:07:42,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=687026.6666666666, ans=0.125 2023-10-11 11:07:48,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=687073.3333333334, ans=0.125 2023-10-11 11:07:56,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.681e+02 1.798e+02 2.037e+02 2.693e+02, threshold=3.596e+02, percent-clipped=0.0 2023-10-11 11:08:08,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=687166.6666666666, ans=0.0 2023-10-11 11:08:15,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=687166.6666666666, ans=0.1 2023-10-11 11:08:20,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=687213.3333333334, ans=15.0 2023-10-11 11:08:21,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=687213.3333333334, ans=0.2 2023-10-11 11:08:25,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=687213.3333333334, ans=0.125 2023-10-11 11:08:25,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=687213.3333333334, ans=0.025 2023-10-11 11:08:29,333 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-10-11 11:08:32,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.74 vs. limit=22.5 2023-10-11 11:08:49,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=687306.6666666666, ans=0.125 2023-10-11 11:09:03,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=687353.3333333334, ans=0.2 2023-10-11 11:09:07,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=687400.0, ans=0.2 2023-10-11 11:09:08,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=687400.0, ans=0.125 2023-10-11 11:09:31,968 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:09:34,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=687493.3333333334, ans=0.125 2023-10-11 11:09:37,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=687493.3333333334, ans=0.1 2023-10-11 11:09:47,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=687540.0, ans=0.125 2023-10-11 11:09:48,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=687540.0, ans=0.125 2023-10-11 11:09:50,388 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.729e+02 1.941e+02 2.214e+02 2.848e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-11 11:10:08,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=687633.3333333334, ans=0.125 2023-10-11 11:10:16,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=687680.0, ans=0.1 2023-10-11 11:10:31,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=687726.6666666666, ans=0.05 2023-10-11 11:10:37,647 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:10:46,494 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.07 vs. limit=15.0 2023-10-11 11:11:26,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.47 vs. limit=10.0 2023-10-11 11:11:27,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=687960.0, ans=0.125 2023-10-11 11:11:38,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=688006.6666666666, ans=0.125 2023-10-11 11:11:38,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.662e+02 1.873e+02 2.031e+02 2.688e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-11 11:11:49,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=688053.3333333334, ans=0.125 2023-10-11 11:12:07,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=688146.6666666666, ans=0.1 2023-10-11 11:12:16,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=688193.3333333334, ans=0.07 2023-10-11 11:13:06,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=688426.6666666666, ans=0.0 2023-10-11 11:13:15,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=688473.3333333334, ans=0.1 2023-10-11 11:13:20,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=688473.3333333334, ans=0.125 2023-10-11 11:13:27,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.652e+02 1.849e+02 2.074e+02 2.989e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-11 11:13:36,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=688520.0, ans=0.2 2023-10-11 11:13:40,362 INFO [train.py:1031] (0/4) Epoch 11, batch 11000, loss[loss=0.2074, simple_loss=0.3084, pruned_loss=0.05315, over 16568.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2922, pruned_loss=0.05753, over 32625332.46 frames. ], batch size: 219, lr: 3.20e-03, grad_scale: 32.0 2023-10-11 11:13:41,514 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:13:47,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=688566.6666666666, ans=0.125 2023-10-11 11:13:53,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-10-11 11:13:57,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=688613.3333333334, ans=0.0 2023-10-11 11:14:08,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=688660.0, ans=0.125 2023-10-11 11:14:24,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=688753.3333333334, ans=0.2 2023-10-11 11:14:41,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=688800.0, ans=0.05 2023-10-11 11:15:10,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=688893.3333333334, ans=0.125 2023-10-11 11:15:13,180 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:15:19,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=688940.0, ans=0.0 2023-10-11 11:15:23,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.777e+02 1.963e+02 2.195e+02 3.096e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-11 11:15:32,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=688986.6666666666, ans=0.0 2023-10-11 11:16:06,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=689126.6666666666, ans=0.125 2023-10-11 11:16:33,197 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.78 vs. limit=22.5 2023-10-11 11:16:35,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=689220.0, ans=0.125 2023-10-11 11:17:06,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=689360.0, ans=0.0 2023-10-11 11:17:13,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=689406.6666666666, ans=0.125 2023-10-11 11:17:19,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=689406.6666666666, ans=0.0 2023-10-11 11:17:22,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=689406.6666666666, ans=0.0 2023-10-11 11:17:23,644 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.28 vs. limit=15.0 2023-10-11 11:17:24,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.311e+02 1.604e+02 1.777e+02 1.987e+02 3.348e+02, threshold=3.553e+02, percent-clipped=0.0 2023-10-11 11:17:29,013 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.07 vs. limit=6.0 2023-10-11 11:17:38,837 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.75 vs. limit=15.0 2023-10-11 11:17:39,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=689500.0, ans=0.04949747468305833 2023-10-11 11:18:02,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=689593.3333333334, ans=0.125 2023-10-11 11:18:06,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=689593.3333333334, ans=0.125 2023-10-11 11:18:13,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=689640.0, ans=0.125 2023-10-11 11:18:20,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=689686.6666666666, ans=0.125 2023-10-11 11:18:26,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=689686.6666666666, ans=0.035 2023-10-11 11:18:33,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=689733.3333333334, ans=0.1 2023-10-11 11:19:03,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.27 vs. limit=15.0 2023-10-11 11:19:12,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=689873.3333333334, ans=0.1 2023-10-11 11:19:19,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.634e+02 1.766e+02 1.938e+02 2.673e+02, threshold=3.533e+02, percent-clipped=0.0 2023-10-11 11:19:45,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=690013.3333333334, ans=0.0 2023-10-11 11:19:51,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=690013.3333333334, ans=0.0 2023-10-11 11:20:04,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-10-11 11:20:07,476 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=22.5 2023-10-11 11:20:10,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=690106.6666666666, ans=0.125 2023-10-11 11:20:12,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.41 vs. limit=22.5 2023-10-11 11:20:13,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=690106.6666666666, ans=0.015 2023-10-11 11:20:27,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=690153.3333333334, ans=0.2 2023-10-11 11:20:50,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=690246.6666666666, ans=0.125 2023-10-11 11:20:56,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=690293.3333333334, ans=0.125 2023-10-11 11:21:02,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=690340.0, ans=0.2 2023-10-11 11:21:07,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=690340.0, ans=0.09899494936611666 2023-10-11 11:21:14,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.664e+02 1.841e+02 2.018e+02 2.858e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-11 11:21:20,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=690386.6666666666, ans=0.2 2023-10-11 11:21:21,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=690386.6666666666, ans=0.2 2023-10-11 11:22:21,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=690666.6666666666, ans=0.0 2023-10-11 11:22:44,316 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=15.0 2023-10-11 11:22:45,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=690760.0, ans=0.1 2023-10-11 11:22:48,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=690760.0, ans=0.2 2023-10-11 11:23:04,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.753e+02 1.926e+02 2.155e+02 2.473e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-11 11:23:04,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=690853.3333333334, ans=0.125 2023-10-11 11:23:14,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=690900.0, ans=0.2 2023-10-11 11:23:15,356 INFO [train.py:1031] (0/4) Epoch 11, batch 11500, loss[loss=0.1925, simple_loss=0.2899, pruned_loss=0.0475, over 16958.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2918, pruned_loss=0.05731, over 32648474.58 frames. ], batch size: 72, lr: 3.19e-03, grad_scale: 32.0 2023-10-11 11:23:24,151 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-10-11 11:23:31,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=690946.6666666666, ans=0.1 2023-10-11 11:23:32,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=690946.6666666666, ans=0.125 2023-10-11 11:23:32,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=690946.6666666666, ans=0.1 2023-10-11 11:23:35,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=690993.3333333334, ans=0.0 2023-10-11 11:23:55,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=691040.0, ans=0.1 2023-10-11 11:23:57,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-10-11 11:25:00,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=691320.0, ans=0.125 2023-10-11 11:25:00,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.295e+02 1.654e+02 1.866e+02 2.027e+02 2.889e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-11 11:25:21,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=691366.6666666666, ans=0.0 2023-10-11 11:25:28,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=691413.3333333334, ans=0.035 2023-10-11 11:26:13,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=691600.0, ans=0.0 2023-10-11 11:26:26,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=691646.6666666666, ans=0.125 2023-10-11 11:26:40,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-10-11 11:26:54,794 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.728e+02 1.884e+02 2.043e+02 2.720e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-11 11:27:11,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=691833.3333333334, ans=0.1 2023-10-11 11:27:27,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=691926.6666666666, ans=0.07 2023-10-11 11:27:48,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=692020.0, ans=0.0 2023-10-11 11:27:58,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=692020.0, ans=0.125 2023-10-11 11:28:15,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=692113.3333333334, ans=0.125 2023-10-11 11:28:26,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-10-11 11:28:33,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=692160.0, ans=15.0 2023-10-11 11:28:57,556 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.693e+02 1.881e+02 2.177e+02 3.153e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-11 11:28:58,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=692253.3333333334, ans=0.125 2023-10-11 11:29:03,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=692253.3333333334, ans=0.2 2023-10-11 11:29:30,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.24 vs. limit=15.0 2023-10-11 11:29:56,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2023-10-11 11:30:02,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=692486.6666666666, ans=0.125 2023-10-11 11:30:05,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=692486.6666666666, ans=0.0 2023-10-11 11:30:21,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=692580.0, ans=0.125 2023-10-11 11:30:23,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-10-11 11:30:34,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.89 vs. limit=22.5 2023-10-11 11:30:55,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.618e+02 1.758e+02 1.943e+02 2.614e+02, threshold=3.515e+02, percent-clipped=0.0 2023-10-11 11:30:55,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=692720.0, ans=0.125 2023-10-11 11:31:04,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=692720.0, ans=0.05 2023-10-11 11:31:08,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=692766.6666666666, ans=0.5 2023-10-11 11:31:10,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=692766.6666666666, ans=0.125 2023-10-11 11:31:10,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=692766.6666666666, ans=10.0 2023-10-11 11:31:17,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=692813.3333333334, ans=0.2 2023-10-11 11:31:19,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.03 vs. limit=15.0 2023-10-11 11:31:44,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=692906.6666666666, ans=0.1 2023-10-11 11:32:09,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=693000.0, ans=0.0 2023-10-11 11:32:20,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=693046.6666666666, ans=0.125 2023-10-11 11:32:32,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=693093.3333333334, ans=0.125 2023-10-11 11:32:46,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.673e+02 1.812e+02 2.109e+02 2.831e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-11 11:32:56,949 INFO [train.py:1031] (0/4) Epoch 11, batch 12000, loss[loss=0.1708, simple_loss=0.2507, pruned_loss=0.04546, over 15374.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2918, pruned_loss=0.05694, over 32709738.08 frames. ], batch size: 35, lr: 3.19e-03, grad_scale: 32.0 2023-10-11 11:33:02,478 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-10-11 11:33:20,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=693326.6666666666, ans=0.0 2023-10-11 11:33:25,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=693326.6666666666, ans=0.125 2023-10-11 11:34:08,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.11 vs. limit=15.0 2023-10-11 11:34:19,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=693560.0, ans=0.125 2023-10-11 11:34:22,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=693560.0, ans=0.2 2023-10-11 11:34:35,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=693606.6666666666, ans=0.0 2023-10-11 11:34:38,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=693606.6666666666, ans=0.2 2023-10-11 11:34:41,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.693e+02 1.835e+02 2.193e+02 3.345e+02, threshold=3.671e+02, percent-clipped=0.0 2023-10-11 11:34:46,932 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.98 vs. limit=15.0 2023-10-11 11:34:48,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=693653.3333333334, ans=0.1 2023-10-11 11:34:51,401 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.25 vs. limit=15.0 2023-10-11 11:34:57,070 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=14.11 vs. limit=15.0 2023-10-11 11:35:01,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=693746.6666666666, ans=0.125 2023-10-11 11:35:13,480 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:35:13,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=693793.3333333334, ans=0.0 2023-10-11 11:35:14,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=12.0 2023-10-11 11:35:19,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-10-11 11:35:29,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=693840.0, ans=0.125 2023-10-11 11:35:36,539 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=15.0 2023-10-11 11:35:47,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=693933.3333333334, ans=0.04949747468305833 2023-10-11 11:35:51,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=693980.0, ans=0.125 2023-10-11 11:35:51,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=693980.0, ans=0.05 2023-10-11 11:36:07,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=694026.6666666666, ans=0.125 2023-10-11 11:36:15,380 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:36:25,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.647e+02 1.757e+02 1.953e+02 2.933e+02, threshold=3.514e+02, percent-clipped=0.0 2023-10-11 11:36:43,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=694166.6666666666, ans=0.125 2023-10-11 11:36:47,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=694213.3333333334, ans=0.125 2023-10-11 11:36:50,993 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.526e-02 2023-10-11 11:36:51,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=694213.3333333334, ans=0.2 2023-10-11 11:37:13,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=694306.6666666666, ans=0.125 2023-10-11 11:37:18,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=694353.3333333334, ans=0.0 2023-10-11 11:37:23,031 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-10-11 11:37:46,770 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=12.0 2023-10-11 11:37:47,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=694493.3333333334, ans=0.2 2023-10-11 11:37:47,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=694493.3333333334, ans=0.125 2023-10-11 11:37:49,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=694493.3333333334, ans=0.125 2023-10-11 11:38:00,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=694540.0, ans=0.2 2023-10-11 11:38:12,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.04 vs. limit=22.5 2023-10-11 11:38:12,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.687e+02 1.908e+02 2.116e+02 2.786e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-11 11:38:15,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=694586.6666666666, ans=0.125 2023-10-11 11:38:34,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=694680.0, ans=0.2 2023-10-11 11:38:39,022 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.37 vs. limit=6.0 2023-10-11 11:38:49,285 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:39:01,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=694773.3333333334, ans=0.07 2023-10-11 11:39:17,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=694866.6666666666, ans=0.125 2023-10-11 11:39:17,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=694866.6666666666, ans=0.125 2023-10-11 11:39:18,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=22.5 2023-10-11 11:39:35,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=694913.3333333334, ans=0.0 2023-10-11 11:39:54,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=694960.0, ans=15.0 2023-10-11 11:40:03,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=695006.6666666666, ans=0.0 2023-10-11 11:40:04,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=695006.6666666666, ans=0.2 2023-10-11 11:40:08,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.753e+02 1.944e+02 2.242e+02 3.448e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-11 11:40:17,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=695053.3333333334, ans=0.125 2023-10-11 11:40:44,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=695193.3333333334, ans=0.2 2023-10-11 11:41:05,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=695286.6666666666, ans=0.125 2023-10-11 11:41:15,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=695333.3333333334, ans=0.0 2023-10-11 11:41:20,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=695333.3333333334, ans=0.5 2023-10-11 11:41:22,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=695333.3333333334, ans=0.0 2023-10-11 11:41:32,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.57 vs. limit=10.0 2023-10-11 11:41:43,120 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.40 vs. limit=12.0 2023-10-11 11:41:48,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=695473.3333333334, ans=0.0 2023-10-11 11:41:48,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=695473.3333333334, ans=0.125 2023-10-11 11:42:02,732 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.707e+02 1.866e+02 2.136e+02 2.912e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-11 11:42:05,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=695520.0, ans=0.125 2023-10-11 11:42:11,868 INFO [train.py:1031] (0/4) Epoch 11, batch 12500, loss[loss=0.2092, simple_loss=0.3012, pruned_loss=0.05864, over 16011.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2916, pruned_loss=0.05692, over 32732758.92 frames. ], batch size: 43, lr: 3.18e-03, grad_scale: 16.0 2023-10-11 11:42:26,833 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-10-11 11:42:28,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=695613.3333333334, ans=0.125 2023-10-11 11:42:43,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=695706.6666666666, ans=0.1 2023-10-11 11:43:01,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=695753.3333333334, ans=0.0 2023-10-11 11:43:24,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=695893.3333333334, ans=0.125 2023-10-11 11:43:24,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=695893.3333333334, ans=0.125 2023-10-11 11:43:43,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=695940.0, ans=0.1 2023-10-11 11:43:48,899 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.705e+02 1.875e+02 2.066e+02 3.692e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-11 11:43:52,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=695986.6666666666, ans=0.0 2023-10-11 11:44:11,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=696080.0, ans=0.2 2023-10-11 11:44:12,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2023-10-11 11:44:35,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=696173.3333333334, ans=0.125 2023-10-11 11:44:42,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=696173.3333333334, ans=0.125 2023-10-11 11:44:46,018 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.09 vs. limit=15.0 2023-10-11 11:45:10,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=696313.3333333334, ans=0.0 2023-10-11 11:45:14,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-10-11 11:45:16,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=696313.3333333334, ans=0.1 2023-10-11 11:45:31,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=696406.6666666666, ans=0.125 2023-10-11 11:45:48,895 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.671e+02 1.811e+02 1.977e+02 2.636e+02, threshold=3.621e+02, percent-clipped=0.0 2023-10-11 11:45:58,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=696500.0, ans=0.0 2023-10-11 11:46:10,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=696546.6666666666, ans=0.025 2023-10-11 11:46:32,208 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.07 vs. limit=22.5 2023-10-11 11:46:44,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=696686.6666666666, ans=0.125 2023-10-11 11:47:03,233 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.93 vs. limit=15.0 2023-10-11 11:47:06,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=696780.0, ans=0.0 2023-10-11 11:47:09,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=696780.0, ans=0.0 2023-10-11 11:47:12,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=696780.0, ans=0.0 2023-10-11 11:47:33,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=696873.3333333334, ans=0.05 2023-10-11 11:47:37,322 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-10-11 11:47:38,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=696920.0, ans=0.125 2023-10-11 11:47:39,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.753e+02 1.969e+02 2.182e+02 3.072e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-11 11:47:50,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=696966.6666666666, ans=0.0 2023-10-11 11:47:58,715 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.53 vs. limit=6.0 2023-10-11 11:48:06,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=697013.3333333334, ans=0.0 2023-10-11 11:48:11,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.26 vs. limit=15.0 2023-10-11 11:48:13,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=697060.0, ans=0.125 2023-10-11 11:48:24,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=697106.6666666666, ans=0.0 2023-10-11 11:48:27,122 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=15.0 2023-10-11 11:48:27,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=697106.6666666666, ans=0.125 2023-10-11 11:48:48,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.73 vs. limit=22.5 2023-10-11 11:48:52,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=697200.0, ans=0.125 2023-10-11 11:48:57,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=697246.6666666666, ans=0.125 2023-10-11 11:49:03,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-10-11 11:49:04,366 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-10-11 11:49:21,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=697340.0, ans=0.125 2023-10-11 11:49:31,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.670e+02 1.834e+02 2.087e+02 3.451e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-11 11:49:32,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-10-11 11:49:35,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=697386.6666666666, ans=0.125 2023-10-11 11:49:39,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=697433.3333333334, ans=0.1 2023-10-11 11:49:46,081 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-10-11 11:49:57,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=697480.0, ans=0.125 2023-10-11 11:49:57,518 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.28 vs. limit=10.0 2023-10-11 11:50:00,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.47 vs. limit=5.0 2023-10-11 11:50:07,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=697526.6666666666, ans=0.0 2023-10-11 11:50:14,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=697573.3333333334, ans=0.0 2023-10-11 11:50:46,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=697713.3333333334, ans=0.2 2023-10-11 11:50:47,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=697713.3333333334, ans=0.1 2023-10-11 11:50:55,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=697760.0, ans=0.125 2023-10-11 11:51:02,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=697760.0, ans=0.0 2023-10-11 11:51:17,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=697853.3333333334, ans=0.0 2023-10-11 11:51:18,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.741e+02 1.907e+02 2.098e+02 3.619e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-11 11:51:20,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=697853.3333333334, ans=0.0 2023-10-11 11:51:26,411 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=15.0 2023-10-11 11:51:26,934 INFO [train.py:1031] (0/4) Epoch 11, batch 13000, loss[loss=0.202, simple_loss=0.2842, pruned_loss=0.0599, over 16624.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2922, pruned_loss=0.05702, over 32761618.22 frames. ], batch size: 56, lr: 3.18e-03, grad_scale: 32.0 2023-10-11 11:51:27,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=697900.0, ans=0.0 2023-10-11 11:51:35,112 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=15.0 2023-10-11 11:52:05,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=698040.0, ans=0.125 2023-10-11 11:52:17,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=698086.6666666666, ans=0.0 2023-10-11 11:52:41,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=698133.3333333334, ans=0.1 2023-10-11 11:52:57,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=698226.6666666666, ans=0.0 2023-10-11 11:53:07,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=698273.3333333334, ans=0.0 2023-10-11 11:53:07,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=698273.3333333334, ans=0.0 2023-10-11 11:53:09,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=698273.3333333334, ans=0.2 2023-10-11 11:53:11,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.91 vs. limit=15.0 2023-10-11 11:53:21,066 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-10-11 11:53:22,205 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.683e+02 1.884e+02 2.147e+02 2.915e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-11 11:53:22,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=698320.0, ans=0.125 2023-10-11 11:53:27,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=698320.0, ans=0.0 2023-10-11 11:53:28,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=698320.0, ans=0.125 2023-10-11 11:53:36,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=698366.6666666666, ans=0.0 2023-10-11 11:54:34,119 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 11:54:35,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=698600.0, ans=0.1 2023-10-11 11:54:36,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=698600.0, ans=0.125 2023-10-11 11:54:44,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-10-11 11:54:45,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=698646.6666666666, ans=0.0 2023-10-11 11:55:14,621 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.08 vs. limit=15.0 2023-10-11 11:55:18,793 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.598e+02 1.748e+02 1.943e+02 2.768e+02, threshold=3.496e+02, percent-clipped=0.0 2023-10-11 11:55:21,468 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.37 vs. limit=15.0 2023-10-11 11:55:25,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-10-11 11:55:33,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=698833.3333333334, ans=0.125 2023-10-11 11:55:37,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=698833.3333333334, ans=0.2 2023-10-11 11:55:41,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=698880.0, ans=0.125 2023-10-11 11:55:52,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=698926.6666666666, ans=0.2 2023-10-11 11:56:01,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-10-11 11:56:10,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=699020.0, ans=10.0 2023-10-11 11:57:10,139 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-10-11 11:57:10,603 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.665e+02 1.851e+02 2.207e+02 3.448e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-11 11:57:31,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=699346.6666666666, ans=0.125 2023-10-11 11:57:32,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=699346.6666666666, ans=0.125 2023-10-11 11:57:32,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=699346.6666666666, ans=0.0 2023-10-11 11:57:40,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=699346.6666666666, ans=0.125 2023-10-11 11:57:40,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=699346.6666666666, ans=0.125 2023-10-11 11:57:52,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=699393.3333333334, ans=0.0 2023-10-11 11:57:52,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=699393.3333333334, ans=0.0 2023-10-11 11:57:54,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=699440.0, ans=0.1 2023-10-11 11:58:38,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=699626.6666666666, ans=0.0 2023-10-11 11:58:40,098 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-11 11:58:48,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-10-11 11:58:58,149 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.39 vs. limit=22.5 2023-10-11 11:58:59,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=699720.0, ans=0.125 2023-10-11 11:59:00,566 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.696e+02 1.847e+02 2.051e+02 2.903e+02, threshold=3.694e+02, percent-clipped=0.0 2023-10-11 11:59:13,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=699766.6666666666, ans=0.125 2023-10-11 11:59:22,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=699813.3333333334, ans=0.125 2023-10-11 11:59:40,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=699860.0, ans=0.125 2023-10-11 12:00:10,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=700000.0, ans=0.0 2023-10-11 12:00:28,235 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-10-11 12:00:41,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=700140.0, ans=0.1 2023-10-11 12:00:42,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=700140.0, ans=0.125 2023-10-11 12:00:52,285 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.691e+02 1.846e+02 2.124e+02 3.601e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-11 12:00:52,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=700186.6666666666, ans=0.0 2023-10-11 12:00:58,268 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-10-11 12:01:00,755 INFO [train.py:1031] (0/4) Epoch 11, batch 13500, loss[loss=0.1762, simple_loss=0.2738, pruned_loss=0.03934, over 16902.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2916, pruned_loss=0.05679, over 32792868.60 frames. ], batch size: 104, lr: 3.17e-03, grad_scale: 32.0 2023-10-11 12:01:09,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=700233.3333333334, ans=0.0 2023-10-11 12:01:12,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=700280.0, ans=0.125 2023-10-11 12:01:24,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=700326.6666666666, ans=0.125 2023-10-11 12:01:33,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=700326.6666666666, ans=0.2 2023-10-11 12:01:46,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=700373.3333333334, ans=0.0 2023-10-11 12:01:56,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=700420.0, ans=0.125 2023-10-11 12:02:02,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=700466.6666666666, ans=0.1 2023-10-11 12:02:06,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=700466.6666666666, ans=10.0 2023-10-11 12:02:15,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=700513.3333333334, ans=0.1 2023-10-11 12:02:17,411 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.74 vs. limit=15.0 2023-10-11 12:02:21,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=700560.0, ans=0.02 2023-10-11 12:02:23,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=700560.0, ans=0.0 2023-10-11 12:02:23,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=700560.0, ans=0.0 2023-10-11 12:02:47,584 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.294e+02 1.767e+02 1.991e+02 2.377e+02 3.541e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-11 12:02:54,078 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.99 vs. limit=15.0 2023-10-11 12:03:06,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=700746.6666666666, ans=0.125 2023-10-11 12:03:10,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=700746.6666666666, ans=0.125 2023-10-11 12:03:12,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.13 vs. limit=15.0 2023-10-11 12:03:31,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=700840.0, ans=0.1 2023-10-11 12:03:34,419 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:03:34,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=700886.6666666666, ans=0.1 2023-10-11 12:03:48,315 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-11.pt 2023-10-11 12:04:16,835 INFO [train.py:1031] (0/4) Epoch 12, batch 0, loss[loss=0.1901, simple_loss=0.2737, pruned_loss=0.05324, over 16894.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2737, pruned_loss=0.05324, over 16894.00 frames. ], batch size: 123, lr: 3.02e-03, grad_scale: 32.0 2023-10-11 12:04:16,836 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-11 12:04:24,775 INFO [train.py:1063] (0/4) Epoch 12, validation: loss=0.2194, simple_loss=0.3063, pruned_loss=0.06626, over 1020973.00 frames. 2023-10-11 12:04:24,776 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-11 12:04:28,629 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.70 vs. limit=15.0 2023-10-11 12:04:33,554 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.20 vs. limit=15.0 2023-10-11 12:04:34,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.12 vs. limit=15.0 2023-10-11 12:04:40,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=701003.3333333334, ans=0.1 2023-10-11 12:04:52,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=701050.0, ans=0.09899494936611666 2023-10-11 12:05:05,285 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:05:09,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.817e+02 2.063e+02 2.337e+02 3.932e+02, threshold=4.127e+02, percent-clipped=0.0 2023-10-11 12:05:11,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=701096.6666666666, ans=0.125 2023-10-11 12:05:27,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=701190.0, ans=0.125 2023-10-11 12:05:36,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=701236.6666666666, ans=0.125 2023-10-11 12:05:44,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=701236.6666666666, ans=0.0 2023-10-11 12:06:03,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=701330.0, ans=0.0 2023-10-11 12:06:18,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=701376.6666666666, ans=0.07 2023-10-11 12:06:27,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=701423.3333333334, ans=0.5 2023-10-11 12:06:28,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.32 vs. limit=6.0 2023-10-11 12:06:42,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=701516.6666666666, ans=0.2 2023-10-11 12:07:02,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.661e+02 1.853e+02 2.086e+02 3.214e+02, threshold=3.707e+02, percent-clipped=0.0 2023-10-11 12:07:08,176 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-10-11 12:07:09,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=701610.0, ans=0.2 2023-10-11 12:07:51,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=701796.6666666666, ans=0.125 2023-10-11 12:07:54,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=701796.6666666666, ans=0.0 2023-10-11 12:07:57,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=701796.6666666666, ans=0.1 2023-10-11 12:07:59,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=701843.3333333334, ans=0.125 2023-10-11 12:08:04,397 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2023-10-11 12:08:20,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=701936.6666666666, ans=0.0 2023-10-11 12:08:25,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=701936.6666666666, ans=0.0 2023-10-11 12:08:48,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.725e+02 1.904e+02 2.193e+02 3.110e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-11 12:09:16,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=702123.3333333334, ans=0.0 2023-10-11 12:09:39,730 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.08 vs. limit=15.0 2023-10-11 12:09:45,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=702263.3333333334, ans=0.0 2023-10-11 12:09:45,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=702263.3333333334, ans=0.0 2023-10-11 12:09:53,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=702310.0, ans=0.1 2023-10-11 12:09:54,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=702310.0, ans=0.5 2023-10-11 12:10:06,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=702356.6666666666, ans=0.1 2023-10-11 12:10:09,245 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-10-11 12:10:10,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=702356.6666666666, ans=0.125 2023-10-11 12:10:18,371 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.05 vs. limit=22.5 2023-10-11 12:10:19,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=702403.3333333334, ans=0.125 2023-10-11 12:10:22,518 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:10:40,355 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.06 vs. limit=15.0 2023-10-11 12:10:43,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.710e+02 1.854e+02 2.151e+02 3.128e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-11 12:10:46,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=702543.3333333334, ans=0.0 2023-10-11 12:10:55,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=702590.0, ans=0.0 2023-10-11 12:11:44,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=702776.6666666666, ans=0.0 2023-10-11 12:11:57,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=702823.3333333334, ans=0.09899494936611666 2023-10-11 12:12:30,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.689e+02 1.827e+02 2.030e+02 2.887e+02, threshold=3.654e+02, percent-clipped=0.0 2023-10-11 12:12:33,258 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-10-11 12:13:13,529 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.19 vs. limit=15.0 2023-10-11 12:13:36,052 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:13:44,075 INFO [train.py:1031] (0/4) Epoch 12, batch 500, loss[loss=0.207, simple_loss=0.2938, pruned_loss=0.06015, over 15728.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2918, pruned_loss=0.05666, over 7289924.02 frames. ], batch size: 35, lr: 3.02e-03, grad_scale: 32.0 2023-10-11 12:13:58,533 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-11 12:14:09,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=703383.3333333334, ans=0.125 2023-10-11 12:14:09,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=703383.3333333334, ans=0.125 2023-10-11 12:14:14,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=703383.3333333334, ans=0.1 2023-10-11 12:14:26,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.707e+02 1.954e+02 2.239e+02 3.077e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-11 12:14:59,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=703570.0, ans=0.0 2023-10-11 12:15:08,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=703616.6666666666, ans=0.0 2023-10-11 12:15:16,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=703663.3333333334, ans=0.0 2023-10-11 12:15:23,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=703663.3333333334, ans=0.125 2023-10-11 12:15:34,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=703710.0, ans=0.125 2023-10-11 12:16:02,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=703850.0, ans=0.125 2023-10-11 12:16:05,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=703850.0, ans=0.125 2023-10-11 12:16:08,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=703850.0, ans=0.125 2023-10-11 12:16:14,558 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.24 vs. limit=10.0 2023-10-11 12:16:20,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.86 vs. limit=15.0 2023-10-11 12:16:21,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.727e+02 1.857e+02 2.055e+02 2.712e+02, threshold=3.713e+02, percent-clipped=0.0 2023-10-11 12:16:31,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=703943.3333333334, ans=0.1 2023-10-11 12:16:34,182 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.76 vs. limit=22.5 2023-10-11 12:16:48,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=704036.6666666666, ans=0.1 2023-10-11 12:16:53,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=704083.3333333334, ans=0.1 2023-10-11 12:16:58,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=704083.3333333334, ans=0.1 2023-10-11 12:17:05,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=704130.0, ans=0.125 2023-10-11 12:17:06,084 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:17:08,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=704130.0, ans=0.125 2023-10-11 12:17:12,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=704130.0, ans=0.0 2023-10-11 12:17:38,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=704270.0, ans=0.125 2023-10-11 12:17:54,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=704316.6666666666, ans=0.09899494936611666 2023-10-11 12:17:56,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=704316.6666666666, ans=0.125 2023-10-11 12:18:07,837 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2023-10-11 12:18:09,614 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.775e+02 1.993e+02 2.220e+02 2.926e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-11 12:18:11,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=704410.0, ans=0.2 2023-10-11 12:18:14,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=704410.0, ans=0.1 2023-10-11 12:18:24,638 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=4.155e-02 2023-10-11 12:18:32,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=704456.6666666666, ans=0.125 2023-10-11 12:18:48,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=704550.0, ans=0.125 2023-10-11 12:18:52,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=704550.0, ans=0.1 2023-10-11 12:18:58,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=704596.6666666666, ans=0.125 2023-10-11 12:19:00,411 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-10-11 12:19:08,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=704643.3333333334, ans=0.125 2023-10-11 12:19:11,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=704643.3333333334, ans=0.1 2023-10-11 12:19:19,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=704690.0, ans=0.0 2023-10-11 12:19:25,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=704690.0, ans=0.1 2023-10-11 12:19:27,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=704690.0, ans=0.1 2023-10-11 12:19:33,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=704736.6666666666, ans=0.1 2023-10-11 12:19:35,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=704736.6666666666, ans=0.025 2023-10-11 12:20:10,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.263e+02 1.682e+02 1.809e+02 2.020e+02 2.684e+02, threshold=3.618e+02, percent-clipped=0.0 2023-10-11 12:20:14,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=704876.6666666666, ans=0.125 2023-10-11 12:20:42,245 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-10-11 12:21:27,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=705203.3333333334, ans=0.125 2023-10-11 12:21:30,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.32 vs. limit=15.0 2023-10-11 12:21:34,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=705203.3333333334, ans=0.1 2023-10-11 12:21:44,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-10-11 12:21:49,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.43 vs. limit=15.0 2023-10-11 12:21:50,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=705296.6666666666, ans=0.015 2023-10-11 12:21:50,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=705296.6666666666, ans=0.1 2023-10-11 12:22:01,796 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.666e+02 1.897e+02 2.120e+02 3.296e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-11 12:22:04,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=705343.3333333334, ans=0.0 2023-10-11 12:22:37,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=705483.3333333334, ans=0.125 2023-10-11 12:22:54,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.19 vs. limit=15.0 2023-10-11 12:23:01,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=705576.6666666666, ans=0.05 2023-10-11 12:23:07,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=705576.6666666666, ans=0.04949747468305833 2023-10-11 12:23:09,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=705576.6666666666, ans=0.1 2023-10-11 12:23:11,989 INFO [train.py:1031] (0/4) Epoch 12, batch 1000, loss[loss=0.2124, simple_loss=0.2931, pruned_loss=0.06586, over 16002.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2922, pruned_loss=0.05725, over 12900318.27 frames. ], batch size: 296, lr: 3.01e-03, grad_scale: 32.0 2023-10-11 12:23:25,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=705670.0, ans=0.0 2023-10-11 12:23:34,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=705716.6666666666, ans=0.125 2023-10-11 12:23:41,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=705716.6666666666, ans=0.035 2023-10-11 12:23:51,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.32 vs. limit=15.0 2023-10-11 12:23:52,736 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.682e+02 1.814e+02 2.051e+02 2.981e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-11 12:23:53,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=705810.0, ans=0.125 2023-10-11 12:24:06,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=705856.6666666666, ans=0.125 2023-10-11 12:24:15,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=705903.3333333334, ans=0.0 2023-10-11 12:24:16,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=705903.3333333334, ans=0.125 2023-10-11 12:24:19,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=705903.3333333334, ans=0.0 2023-10-11 12:24:50,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=706043.3333333334, ans=0.2 2023-10-11 12:24:59,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=706090.0, ans=0.125 2023-10-11 12:25:10,414 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.92 vs. limit=22.5 2023-10-11 12:25:15,169 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.78 vs. limit=15.0 2023-10-11 12:25:19,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-10-11 12:25:24,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=706183.3333333334, ans=0.125 2023-10-11 12:25:25,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=706183.3333333334, ans=0.125 2023-10-11 12:25:31,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=15.0 2023-10-11 12:25:40,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=706230.0, ans=0.0 2023-10-11 12:25:46,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.766e+02 1.940e+02 2.109e+02 2.953e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-11 12:26:02,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=706323.3333333334, ans=0.125 2023-10-11 12:26:33,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=706416.6666666666, ans=0.125 2023-10-11 12:26:41,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=706463.3333333334, ans=0.125 2023-10-11 12:26:41,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=706463.3333333334, ans=0.0 2023-10-11 12:26:51,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=706510.0, ans=0.125 2023-10-11 12:26:52,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=706510.0, ans=0.125 2023-10-11 12:27:05,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=706556.6666666666, ans=10.0 2023-10-11 12:27:09,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=706556.6666666666, ans=0.125 2023-10-11 12:27:14,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=706603.3333333334, ans=0.125 2023-10-11 12:27:25,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=706650.0, ans=0.07 2023-10-11 12:27:25,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=706650.0, ans=0.125 2023-10-11 12:27:30,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=706650.0, ans=0.1 2023-10-11 12:27:47,682 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.603e+02 1.795e+02 2.057e+02 2.830e+02, threshold=3.591e+02, percent-clipped=0.0 2023-10-11 12:28:10,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=706836.6666666666, ans=0.0 2023-10-11 12:28:14,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.75 vs. limit=15.0 2023-10-11 12:28:30,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=706930.0, ans=0.125 2023-10-11 12:28:35,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=706930.0, ans=0.0 2023-10-11 12:28:39,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=706930.0, ans=0.0 2023-10-11 12:28:50,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=707023.3333333334, ans=0.1 2023-10-11 12:29:05,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=707070.0, ans=0.125 2023-10-11 12:29:34,013 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.259e+02 1.602e+02 1.772e+02 1.976e+02 2.706e+02, threshold=3.544e+02, percent-clipped=0.0 2023-10-11 12:29:36,587 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.52 vs. limit=22.5 2023-10-11 12:29:38,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=707210.0, ans=0.0 2023-10-11 12:29:46,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-11 12:30:05,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=707303.3333333334, ans=0.1 2023-10-11 12:30:46,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=707490.0, ans=0.125 2023-10-11 12:31:16,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=707630.0, ans=0.125 2023-10-11 12:31:18,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=707630.0, ans=0.025 2023-10-11 12:31:23,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=707630.0, ans=0.0 2023-10-11 12:31:25,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.665e+02 1.867e+02 2.077e+02 3.015e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-11 12:31:47,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=707723.3333333334, ans=0.1 2023-10-11 12:31:50,636 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.69 vs. limit=15.0 2023-10-11 12:31:53,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=707770.0, ans=0.0 2023-10-11 12:31:55,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=707770.0, ans=0.125 2023-10-11 12:32:14,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=707816.6666666666, ans=0.125 2023-10-11 12:32:30,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=707910.0, ans=0.0 2023-10-11 12:32:41,911 INFO [train.py:1031] (0/4) Epoch 12, batch 1500, loss[loss=0.1768, simple_loss=0.2686, pruned_loss=0.04244, over 16646.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2907, pruned_loss=0.05662, over 17306537.91 frames. ], batch size: 56, lr: 3.01e-03, grad_scale: 32.0 2023-10-11 12:32:44,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=707956.6666666666, ans=0.2 2023-10-11 12:32:51,061 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.25 vs. limit=15.0 2023-10-11 12:33:25,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.649e+02 1.784e+02 2.034e+02 2.580e+02, threshold=3.568e+02, percent-clipped=0.0 2023-10-11 12:33:35,519 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.85 vs. limit=10.0 2023-10-11 12:33:37,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=708190.0, ans=0.0 2023-10-11 12:33:43,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=708190.0, ans=0.09899494936611666 2023-10-11 12:33:50,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=708236.6666666666, ans=0.0 2023-10-11 12:34:09,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=708283.3333333334, ans=0.0 2023-10-11 12:34:15,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=708330.0, ans=0.0 2023-10-11 12:34:16,148 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=15.0 2023-10-11 12:34:25,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=708376.6666666666, ans=0.015 2023-10-11 12:34:32,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=708376.6666666666, ans=0.0 2023-10-11 12:34:36,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=708423.3333333334, ans=0.2 2023-10-11 12:34:42,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708423.3333333334, ans=0.1 2023-10-11 12:34:48,193 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.91 vs. limit=15.0 2023-10-11 12:35:09,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.54 vs. limit=12.0 2023-10-11 12:35:24,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.654e+02 1.810e+02 2.078e+02 2.616e+02, threshold=3.619e+02, percent-clipped=0.0 2023-10-11 12:35:30,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=708610.0, ans=0.0 2023-10-11 12:35:31,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708610.0, ans=0.1 2023-10-11 12:35:32,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=708610.0, ans=0.04949747468305833 2023-10-11 12:35:57,103 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=12.0 2023-10-11 12:36:00,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=708750.0, ans=0.025 2023-10-11 12:36:04,608 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:36:09,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708796.6666666666, ans=0.1 2023-10-11 12:36:25,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=708843.3333333334, ans=0.125 2023-10-11 12:36:37,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=708890.0, ans=0.125 2023-10-11 12:36:41,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=708936.6666666666, ans=0.125 2023-10-11 12:36:55,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=708983.3333333334, ans=0.0 2023-10-11 12:37:04,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-10-11 12:37:12,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=709030.0, ans=0.125 2023-10-11 12:37:12,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.679e+02 1.871e+02 2.086e+02 2.821e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-11 12:37:12,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=709076.6666666666, ans=0.2 2023-10-11 12:37:29,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=709123.3333333334, ans=0.125 2023-10-11 12:37:29,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=709123.3333333334, ans=0.125 2023-10-11 12:37:37,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=709170.0, ans=0.125 2023-10-11 12:37:37,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=709170.0, ans=0.125 2023-10-11 12:37:50,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=709216.6666666666, ans=0.1 2023-10-11 12:37:51,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=709216.6666666666, ans=0.0 2023-10-11 12:38:12,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=709310.0, ans=0.125 2023-10-11 12:38:16,869 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-152000.pt 2023-10-11 12:38:41,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=709403.3333333334, ans=0.125 2023-10-11 12:38:48,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=709450.0, ans=0.2 2023-10-11 12:38:59,287 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=22.5 2023-10-11 12:39:04,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=709496.6666666666, ans=0.5 2023-10-11 12:39:09,698 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.636e+02 1.801e+02 2.020e+02 3.521e+02, threshold=3.602e+02, percent-clipped=0.0 2023-10-11 12:39:17,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=709543.3333333334, ans=0.0 2023-10-11 12:39:37,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=709636.6666666666, ans=0.0 2023-10-11 12:39:42,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=709636.6666666666, ans=6.0 2023-10-11 12:39:53,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=709683.3333333334, ans=0.1 2023-10-11 12:39:55,125 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:40:10,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=709776.6666666666, ans=0.0 2023-10-11 12:40:32,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=709870.0, ans=0.07 2023-10-11 12:40:33,004 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:40:45,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=709916.6666666666, ans=0.125 2023-10-11 12:41:03,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.751e+02 1.939e+02 2.153e+02 2.870e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-11 12:41:04,014 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:41:19,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=710056.6666666666, ans=0.07 2023-10-11 12:41:29,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=710056.6666666666, ans=0.125 2023-10-11 12:41:38,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=710103.3333333334, ans=0.1 2023-10-11 12:42:00,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=710196.6666666666, ans=0.0 2023-10-11 12:42:11,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=710243.3333333334, ans=0.125 2023-10-11 12:42:25,865 INFO [train.py:1031] (0/4) Epoch 12, batch 2000, loss[loss=0.2027, simple_loss=0.2979, pruned_loss=0.0537, over 16859.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2914, pruned_loss=0.05667, over 20748475.72 frames. ], batch size: 72, lr: 3.00e-03, grad_scale: 32.0 2023-10-11 12:42:32,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=710290.0, ans=0.125 2023-10-11 12:43:17,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=710430.0, ans=0.125 2023-10-11 12:43:18,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=710430.0, ans=0.2 2023-10-11 12:43:20,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.700e+02 1.863e+02 2.117e+02 2.937e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-11 12:43:27,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=710476.6666666666, ans=0.05 2023-10-11 12:43:27,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=710476.6666666666, ans=0.1 2023-10-11 12:43:30,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=710476.6666666666, ans=0.0 2023-10-11 12:43:48,200 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-10-11 12:43:53,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.05 vs. limit=15.0 2023-10-11 12:44:03,833 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-10-11 12:44:50,297 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:45:35,529 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.681e+02 1.916e+02 2.186e+02 2.783e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-11 12:45:47,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=710943.3333333334, ans=0.0 2023-10-11 12:46:21,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=711083.3333333334, ans=0.0 2023-10-11 12:46:22,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=711083.3333333334, ans=0.0 2023-10-11 12:46:25,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=711083.3333333334, ans=0.125 2023-10-11 12:47:31,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.825e+02 1.995e+02 2.262e+02 3.131e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-11 12:47:33,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=711410.0, ans=0.1 2023-10-11 12:47:34,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=711410.0, ans=0.025 2023-10-11 12:47:46,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=711456.6666666666, ans=0.125 2023-10-11 12:47:46,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=711456.6666666666, ans=0.0 2023-10-11 12:48:04,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.06 vs. limit=22.5 2023-10-11 12:48:04,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.35 vs. limit=10.0 2023-10-11 12:48:16,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=711596.6666666666, ans=0.1 2023-10-11 12:48:20,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=711596.6666666666, ans=0.125 2023-10-11 12:48:34,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-10-11 12:48:57,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=711783.3333333334, ans=15.0 2023-10-11 12:49:01,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=711783.3333333334, ans=0.0 2023-10-11 12:49:03,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=711783.3333333334, ans=0.1 2023-10-11 12:49:14,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=711830.0, ans=0.125 2023-10-11 12:49:18,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.760e+02 1.937e+02 2.187e+02 3.216e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-11 12:49:49,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=711970.0, ans=0.2 2023-10-11 12:49:58,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=712016.6666666666, ans=0.2 2023-10-11 12:50:02,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=712016.6666666666, ans=0.1 2023-10-11 12:50:17,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=712110.0, ans=0.2 2023-10-11 12:50:28,213 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:50:33,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=712156.6666666666, ans=0.0 2023-10-11 12:50:48,817 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-10-11 12:50:51,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=712250.0, ans=0.125 2023-10-11 12:50:53,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=712250.0, ans=0.0 2023-10-11 12:50:53,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=712250.0, ans=0.125 2023-10-11 12:51:07,205 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.697e+02 1.854e+02 2.019e+02 2.574e+02, threshold=3.707e+02, percent-clipped=0.0 2023-10-11 12:51:10,514 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-10-11 12:51:11,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=712343.3333333334, ans=0.125 2023-10-11 12:51:16,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=712343.3333333334, ans=0.125 2023-10-11 12:51:42,171 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-10-11 12:52:01,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=712576.6666666666, ans=0.0 2023-10-11 12:52:13,096 INFO [train.py:1031] (0/4) Epoch 12, batch 2500, loss[loss=0.2107, simple_loss=0.3017, pruned_loss=0.05988, over 16963.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2915, pruned_loss=0.05677, over 23424956.02 frames. ], batch size: 123, lr: 3.00e-03, grad_scale: 32.0 2023-10-11 12:52:15,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=712623.3333333334, ans=0.015 2023-10-11 12:52:53,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.764e+02 1.957e+02 2.259e+02 3.026e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-11 12:53:14,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=712856.6666666666, ans=0.2 2023-10-11 12:53:16,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-10-11 12:53:19,906 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.33 vs. limit=15.0 2023-10-11 12:53:27,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=712950.0, ans=0.0 2023-10-11 12:53:32,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=712950.0, ans=0.125 2023-10-11 12:53:37,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=712950.0, ans=0.1 2023-10-11 12:54:02,895 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.06 vs. limit=15.0 2023-10-11 12:54:13,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=713136.6666666666, ans=0.125 2023-10-11 12:54:17,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=713136.6666666666, ans=0.0 2023-10-11 12:54:19,342 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.60 vs. limit=22.5 2023-10-11 12:54:28,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=713183.3333333334, ans=0.125 2023-10-11 12:54:31,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=713183.3333333334, ans=0.1 2023-10-11 12:54:34,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=713183.3333333334, ans=0.1 2023-10-11 12:54:41,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.80 vs. limit=22.5 2023-10-11 12:54:46,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.664e+02 1.833e+02 2.136e+02 2.919e+02, threshold=3.666e+02, percent-clipped=0.0 2023-10-11 12:55:49,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=713510.0, ans=0.0 2023-10-11 12:56:14,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=713603.3333333334, ans=0.125 2023-10-11 12:56:18,220 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 12:56:38,213 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-10-11 12:56:42,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.692e+02 1.877e+02 2.135e+02 2.987e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-11 12:56:53,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=713743.3333333334, ans=0.125 2023-10-11 12:57:22,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=713883.3333333334, ans=0.1 2023-10-11 12:57:40,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=713930.0, ans=0.125 2023-10-11 12:57:48,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=713976.6666666666, ans=0.125 2023-10-11 12:57:56,811 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.94 vs. limit=22.5 2023-10-11 12:57:58,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=714023.3333333334, ans=0.2 2023-10-11 12:58:04,424 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.11 vs. limit=15.0 2023-10-11 12:58:08,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=714023.3333333334, ans=0.2 2023-10-11 12:58:15,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=714070.0, ans=0.1 2023-10-11 12:58:15,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=714070.0, ans=0.2 2023-10-11 12:58:19,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=714116.6666666666, ans=0.0 2023-10-11 12:58:42,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=714163.3333333334, ans=22.5 2023-10-11 12:58:47,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.684e+02 1.904e+02 2.108e+02 2.908e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-11 12:58:48,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=714210.0, ans=0.125 2023-10-11 12:58:51,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=714210.0, ans=0.125 2023-10-11 12:58:52,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=714210.0, ans=0.0 2023-10-11 12:58:52,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-10-11 12:59:17,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=714303.3333333334, ans=0.0 2023-10-11 13:00:02,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=714443.3333333334, ans=0.125 2023-10-11 13:00:02,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.40 vs. limit=22.5 2023-10-11 13:00:06,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.20 vs. limit=15.0 2023-10-11 13:00:06,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=714490.0, ans=0.0 2023-10-11 13:00:10,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=714490.0, ans=0.125 2023-10-11 13:00:18,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=714536.6666666666, ans=0.0 2023-10-11 13:00:49,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=714630.0, ans=0.125 2023-10-11 13:00:50,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-10-11 13:00:53,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.676e+02 1.838e+02 2.055e+02 2.871e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-11 13:00:55,647 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-10-11 13:00:59,394 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2023-10-11 13:01:09,383 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.47 vs. limit=15.0 2023-10-11 13:01:28,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=714816.6666666666, ans=0.125 2023-10-11 13:01:37,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=714863.3333333334, ans=0.125 2023-10-11 13:01:49,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=714910.0, ans=0.1 2023-10-11 13:01:58,182 INFO [train.py:1031] (0/4) Epoch 12, batch 3000, loss[loss=0.1935, simple_loss=0.2822, pruned_loss=0.05243, over 16945.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2908, pruned_loss=0.05678, over 25501956.94 frames. ], batch size: 123, lr: 2.99e-03, grad_scale: 32.0 2023-10-11 13:02:18,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=715050.0, ans=0.0 2023-10-11 13:02:41,806 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.650e+02 1.873e+02 2.068e+02 3.296e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-11 13:02:59,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=715190.0, ans=0.125 2023-10-11 13:03:27,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=715330.0, ans=0.125 2023-10-11 13:03:38,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.96 vs. limit=22.5 2023-10-11 13:03:49,217 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.78 vs. limit=22.5 2023-10-11 13:04:09,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=715470.0, ans=0.05 2023-10-11 13:04:30,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=715563.3333333334, ans=0.125 2023-10-11 13:04:39,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.853e+02 2.050e+02 2.371e+02 3.733e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-11 13:04:56,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=715656.6666666666, ans=0.125 2023-10-11 13:05:02,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=715703.3333333334, ans=0.0 2023-10-11 13:05:34,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=715843.3333333334, ans=0.1 2023-10-11 13:05:36,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=715843.3333333334, ans=0.1 2023-10-11 13:05:37,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=715843.3333333334, ans=0.0 2023-10-11 13:05:50,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.49 vs. limit=22.5 2023-10-11 13:06:16,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=715983.3333333334, ans=0.1 2023-10-11 13:06:25,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-10-11 13:06:27,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=716030.0, ans=0.2 2023-10-11 13:06:30,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=716030.0, ans=0.125 2023-10-11 13:06:40,044 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.260e+02 1.684e+02 1.873e+02 2.026e+02 3.332e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-11 13:06:49,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=716076.6666666666, ans=0.125 2023-10-11 13:07:04,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=716123.3333333334, ans=0.2 2023-10-11 13:07:19,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=716216.6666666666, ans=0.125 2023-10-11 13:07:36,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=716263.3333333334, ans=0.2 2023-10-11 13:07:50,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=716310.0, ans=0.125 2023-10-11 13:07:59,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=716356.6666666666, ans=0.1 2023-10-11 13:08:11,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=716403.3333333334, ans=0.125 2023-10-11 13:08:19,663 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:08:32,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=716496.6666666666, ans=0.125 2023-10-11 13:08:36,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.700e+02 1.946e+02 2.132e+02 2.838e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-11 13:08:51,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2023-10-11 13:09:01,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=716636.6666666666, ans=0.0 2023-10-11 13:09:01,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=716636.6666666666, ans=0.1 2023-10-11 13:09:03,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=716636.6666666666, ans=0.0 2023-10-11 13:09:31,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=716730.0, ans=0.0 2023-10-11 13:09:32,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=716730.0, ans=0.0 2023-10-11 13:10:05,580 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-10-11 13:10:06,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=716916.6666666666, ans=0.0 2023-10-11 13:10:19,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=716963.3333333334, ans=0.125 2023-10-11 13:10:20,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=716963.3333333334, ans=0.0 2023-10-11 13:10:24,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=716963.3333333334, ans=0.2 2023-10-11 13:10:28,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.724e+02 1.891e+02 2.086e+02 2.802e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-11 13:10:47,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=717056.6666666666, ans=0.125 2023-10-11 13:11:25,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=717243.3333333334, ans=0.2 2023-10-11 13:11:34,828 INFO [train.py:1031] (0/4) Epoch 12, batch 3500, loss[loss=0.2119, simple_loss=0.2999, pruned_loss=0.06197, over 16722.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2907, pruned_loss=0.05698, over 27113986.45 frames. ], batch size: 202, lr: 2.99e-03, grad_scale: 32.0 2023-10-11 13:11:35,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=717290.0, ans=0.125 2023-10-11 13:11:51,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=717336.6666666666, ans=0.1 2023-10-11 13:11:57,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.02 vs. limit=15.0 2023-10-11 13:12:06,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=717430.0, ans=0.1 2023-10-11 13:12:09,253 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.21 vs. limit=22.5 2023-10-11 13:12:11,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=717430.0, ans=0.125 2023-10-11 13:12:18,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.685e+02 1.808e+02 1.987e+02 2.523e+02, threshold=3.616e+02, percent-clipped=0.0 2023-10-11 13:12:21,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=717476.6666666666, ans=0.0 2023-10-11 13:12:41,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=717570.0, ans=10.0 2023-10-11 13:12:41,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=717570.0, ans=0.2 2023-10-11 13:13:11,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=717663.3333333334, ans=0.1 2023-10-11 13:13:16,200 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.51 vs. limit=15.0 2023-10-11 13:13:35,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=717756.6666666666, ans=0.125 2023-10-11 13:13:39,017 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.60 vs. limit=22.5 2023-10-11 13:13:53,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=717850.0, ans=0.0 2023-10-11 13:14:08,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=717896.6666666666, ans=0.1 2023-10-11 13:14:12,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-10-11 13:14:17,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.685e+02 1.910e+02 2.125e+02 3.000e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-11 13:14:30,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=717990.0, ans=0.125 2023-10-11 13:14:39,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=718036.6666666666, ans=0.2 2023-10-11 13:14:42,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=718036.6666666666, ans=0.2 2023-10-11 13:14:51,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=718083.3333333334, ans=0.95 2023-10-11 13:15:05,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=718130.0, ans=0.125 2023-10-11 13:15:12,171 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.85 vs. limit=15.0 2023-10-11 13:15:38,562 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-10-11 13:15:38,643 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.01 vs. limit=15.0 2023-10-11 13:15:47,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=718316.6666666666, ans=0.0 2023-10-11 13:16:16,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.657e+02 1.878e+02 2.181e+02 2.770e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-11 13:16:19,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=718410.0, ans=0.125 2023-10-11 13:16:54,561 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:17:10,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=718596.6666666666, ans=0.05 2023-10-11 13:17:15,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=718643.3333333334, ans=0.125 2023-10-11 13:17:32,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=718690.0, ans=0.2 2023-10-11 13:17:53,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.65 vs. limit=15.0 2023-10-11 13:17:59,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=718783.3333333334, ans=0.1 2023-10-11 13:18:04,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=718830.0, ans=22.5 2023-10-11 13:18:10,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=718830.0, ans=0.125 2023-10-11 13:18:10,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-10-11 13:18:11,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=718876.6666666666, ans=0.125 2023-10-11 13:18:13,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.726e+02 1.839e+02 2.012e+02 2.587e+02, threshold=3.677e+02, percent-clipped=0.0 2023-10-11 13:18:30,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=718923.3333333334, ans=0.0 2023-10-11 13:18:48,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=719016.6666666666, ans=0.1 2023-10-11 13:19:04,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=719110.0, ans=0.2 2023-10-11 13:19:04,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=719110.0, ans=0.05 2023-10-11 13:19:12,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=719110.0, ans=0.07 2023-10-11 13:19:26,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=719156.6666666666, ans=0.0 2023-10-11 13:19:38,363 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:19:49,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=719250.0, ans=0.0 2023-10-11 13:19:54,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=719296.6666666666, ans=0.1 2023-10-11 13:19:59,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=719296.6666666666, ans=0.2 2023-10-11 13:20:02,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.674e+02 1.932e+02 2.186e+02 2.864e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-11 13:20:03,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=719343.3333333334, ans=0.0 2023-10-11 13:20:04,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=719343.3333333334, ans=0.125 2023-10-11 13:20:15,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=719390.0, ans=0.035 2023-10-11 13:21:01,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=719576.6666666666, ans=0.125 2023-10-11 13:21:01,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=719576.6666666666, ans=0.0 2023-10-11 13:21:09,402 INFO [train.py:1031] (0/4) Epoch 12, batch 4000, loss[loss=0.2027, simple_loss=0.3026, pruned_loss=0.05143, over 16816.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2902, pruned_loss=0.0568, over 28372168.23 frames. ], batch size: 155, lr: 2.98e-03, grad_scale: 32.0 2023-10-11 13:21:18,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=719623.3333333334, ans=0.125 2023-10-11 13:21:37,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=719716.6666666666, ans=0.0 2023-10-11 13:21:52,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.62 vs. limit=15.0 2023-10-11 13:22:00,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.714e+02 1.907e+02 2.159e+02 2.863e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-11 13:22:01,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=719810.0, ans=0.125 2023-10-11 13:22:03,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=719810.0, ans=0.2 2023-10-11 13:22:11,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=719856.6666666666, ans=0.125 2023-10-11 13:22:17,881 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-10-11 13:22:20,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=719856.6666666666, ans=0.0 2023-10-11 13:22:20,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=719856.6666666666, ans=0.2 2023-10-11 13:22:36,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.23 vs. limit=15.0 2023-10-11 13:22:38,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=719950.0, ans=0.1 2023-10-11 13:22:41,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=719950.0, ans=0.2 2023-10-11 13:22:42,843 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.81 vs. limit=22.5 2023-10-11 13:22:55,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=719996.6666666666, ans=0.1 2023-10-11 13:22:57,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=720043.3333333334, ans=0.2 2023-10-11 13:23:05,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=720043.3333333334, ans=0.125 2023-10-11 13:23:34,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=720183.3333333334, ans=0.1 2023-10-11 13:23:35,737 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.18 vs. limit=6.0 2023-10-11 13:23:46,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=720230.0, ans=0.1 2023-10-11 13:23:49,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=720230.0, ans=0.1 2023-10-11 13:23:50,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=720230.0, ans=0.1 2023-10-11 13:23:50,346 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.77 vs. limit=22.5 2023-10-11 13:23:57,362 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.739e+02 1.919e+02 2.150e+02 2.720e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-11 13:24:25,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=720370.0, ans=0.2 2023-10-11 13:24:25,802 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.05 vs. limit=15.0 2023-10-11 13:24:43,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=720416.6666666666, ans=0.0 2023-10-11 13:24:45,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=720416.6666666666, ans=0.0 2023-10-11 13:24:56,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=720463.3333333334, ans=0.0 2023-10-11 13:25:00,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=720463.3333333334, ans=0.04949747468305833 2023-10-11 13:25:14,861 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.53 vs. limit=22.5 2023-10-11 13:25:16,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=720556.6666666666, ans=0.0 2023-10-11 13:25:16,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=720556.6666666666, ans=0.125 2023-10-11 13:25:33,198 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:25:33,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=720603.3333333334, ans=0.0 2023-10-11 13:26:06,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=720743.3333333334, ans=0.035 2023-10-11 13:26:09,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.654e+02 1.805e+02 2.012e+02 2.655e+02, threshold=3.609e+02, percent-clipped=0.0 2023-10-11 13:26:11,715 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-10-11 13:26:31,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=720836.6666666666, ans=0.125 2023-10-11 13:26:49,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.47 vs. limit=22.5 2023-10-11 13:26:50,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=720930.0, ans=0.125 2023-10-11 13:26:51,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=720930.0, ans=0.125 2023-10-11 13:26:51,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=720930.0, ans=0.0 2023-10-11 13:27:07,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=720976.6666666666, ans=0.1 2023-10-11 13:27:22,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=721023.3333333334, ans=0.0 2023-10-11 13:27:35,030 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.06 vs. limit=15.0 2023-10-11 13:27:54,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=721163.3333333334, ans=0.125 2023-10-11 13:27:55,530 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2023-10-11 13:27:58,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=721210.0, ans=0.1 2023-10-11 13:27:58,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.724e+02 1.835e+02 2.048e+02 3.224e+02, threshold=3.670e+02, percent-clipped=0.0 2023-10-11 13:28:14,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=721256.6666666666, ans=0.2 2023-10-11 13:28:29,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.25 vs. limit=22.5 2023-10-11 13:28:40,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=721350.0, ans=0.2 2023-10-11 13:28:50,968 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.83 vs. limit=22.5 2023-10-11 13:29:08,131 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:29:09,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=721490.0, ans=0.125 2023-10-11 13:29:23,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=721536.6666666666, ans=0.2 2023-10-11 13:29:25,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=721536.6666666666, ans=0.125 2023-10-11 13:29:32,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-10-11 13:29:46,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=721630.0, ans=0.0 2023-10-11 13:29:51,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=721630.0, ans=0.0 2023-10-11 13:29:55,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=721676.6666666666, ans=0.125 2023-10-11 13:29:58,535 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.740e+02 1.928e+02 2.135e+02 3.122e+02, threshold=3.856e+02, percent-clipped=0.0 2023-10-11 13:30:04,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=721676.6666666666, ans=0.0 2023-10-11 13:30:18,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-10-11 13:30:18,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-10-11 13:30:27,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=721770.0, ans=0.125 2023-10-11 13:30:59,276 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:31:07,478 INFO [train.py:1031] (0/4) Epoch 12, batch 4500, loss[loss=0.172, simple_loss=0.2705, pruned_loss=0.03677, over 16827.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2906, pruned_loss=0.05678, over 29332627.56 frames. ], batch size: 98, lr: 2.98e-03, grad_scale: 32.0 2023-10-11 13:31:23,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=722003.3333333334, ans=0.0 2023-10-11 13:31:25,277 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.29 vs. limit=15.0 2023-10-11 13:31:26,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-10-11 13:31:33,027 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:31:52,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.657e+02 1.796e+02 2.027e+02 2.731e+02, threshold=3.592e+02, percent-clipped=0.0 2023-10-11 13:31:57,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=722143.3333333334, ans=0.125 2023-10-11 13:32:29,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=722283.3333333334, ans=0.0 2023-10-11 13:32:31,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=722330.0, ans=0.125 2023-10-11 13:32:31,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=722330.0, ans=0.125 2023-10-11 13:33:17,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=722516.6666666666, ans=0.0 2023-10-11 13:33:25,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=722563.3333333334, ans=0.05 2023-10-11 13:33:36,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.712e+02 1.900e+02 2.088e+02 2.791e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-11 13:33:42,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=722610.0, ans=0.0 2023-10-11 13:33:51,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=722656.6666666666, ans=0.2 2023-10-11 13:34:06,305 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-10-11 13:34:11,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=722750.0, ans=0.2 2023-10-11 13:34:46,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=722890.0, ans=0.2 2023-10-11 13:35:03,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=722983.3333333334, ans=0.0 2023-10-11 13:35:13,508 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.82 vs. limit=15.0 2023-10-11 13:35:14,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=723030.0, ans=0.125 2023-10-11 13:35:26,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.721e+02 1.856e+02 1.989e+02 3.133e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-11 13:35:27,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=723076.6666666666, ans=0.1 2023-10-11 13:35:34,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=723123.3333333334, ans=0.2 2023-10-11 13:35:48,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.90 vs. limit=22.5 2023-10-11 13:36:24,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.99 vs. limit=15.0 2023-10-11 13:36:31,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=723356.6666666666, ans=0.125 2023-10-11 13:36:55,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=723450.0, ans=15.0 2023-10-11 13:36:56,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=723450.0, ans=0.09899494936611666 2023-10-11 13:37:03,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=723450.0, ans=0.125 2023-10-11 13:37:06,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=723496.6666666666, ans=0.0 2023-10-11 13:37:21,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=723543.3333333334, ans=0.125 2023-10-11 13:37:22,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.674e+02 1.834e+02 2.184e+02 3.536e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-11 13:37:30,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.04 vs. limit=15.0 2023-10-11 13:37:46,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=723636.6666666666, ans=0.125 2023-10-11 13:37:55,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=723683.3333333334, ans=0.2 2023-10-11 13:37:57,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=723683.3333333334, ans=0.2 2023-10-11 13:37:59,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=723683.3333333334, ans=0.0 2023-10-11 13:38:38,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=723870.0, ans=0.125 2023-10-11 13:38:43,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=723870.0, ans=22.5 2023-10-11 13:39:19,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=724010.0, ans=0.2 2023-10-11 13:39:20,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.668e+02 1.843e+02 2.122e+02 2.826e+02, threshold=3.687e+02, percent-clipped=0.0 2023-10-11 13:39:29,562 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-10-11 13:39:32,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.70 vs. limit=15.0 2023-10-11 13:40:01,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=724196.6666666666, ans=0.0 2023-10-11 13:40:03,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=724196.6666666666, ans=0.125 2023-10-11 13:40:14,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=724243.3333333334, ans=0.2 2023-10-11 13:40:23,659 INFO [train.py:1031] (0/4) Epoch 12, batch 5000, loss[loss=0.1886, simple_loss=0.2878, pruned_loss=0.04465, over 16903.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2907, pruned_loss=0.05707, over 30101561.72 frames. ], batch size: 93, lr: 2.97e-03, grad_scale: 32.0 2023-10-11 13:40:42,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=724336.6666666666, ans=0.5 2023-10-11 13:40:53,821 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:41:10,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=724476.6666666666, ans=0.125 2023-10-11 13:41:11,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.712e+02 1.907e+02 2.232e+02 3.440e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-11 13:41:27,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.25 vs. limit=10.0 2023-10-11 13:41:28,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=724523.3333333334, ans=0.0 2023-10-11 13:41:32,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=724570.0, ans=0.125 2023-10-11 13:41:53,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=724663.3333333334, ans=0.0 2023-10-11 13:42:29,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=724756.6666666666, ans=0.0 2023-10-11 13:42:32,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=724803.3333333334, ans=0.2 2023-10-11 13:42:32,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=724803.3333333334, ans=0.1 2023-10-11 13:42:39,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-10-11 13:42:41,061 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.31 vs. limit=15.0 2023-10-11 13:42:48,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=724850.0, ans=0.0 2023-10-11 13:42:57,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=724896.6666666666, ans=0.125 2023-10-11 13:43:05,058 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.712e+02 1.870e+02 2.128e+02 2.757e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-11 13:43:19,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.62 vs. limit=15.0 2023-10-11 13:43:32,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=725036.6666666666, ans=0.125 2023-10-11 13:43:32,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=725036.6666666666, ans=0.125 2023-10-11 13:44:02,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=725176.6666666666, ans=0.05 2023-10-11 13:44:10,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=725223.3333333334, ans=0.125 2023-10-11 13:44:11,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=725223.3333333334, ans=0.125 2023-10-11 13:44:35,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=725316.6666666666, ans=0.0 2023-10-11 13:44:42,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=725363.3333333334, ans=0.0 2023-10-11 13:44:44,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=725363.3333333334, ans=0.125 2023-10-11 13:44:50,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.675e+02 1.851e+02 2.061e+02 2.575e+02, threshold=3.702e+02, percent-clipped=0.0 2023-10-11 13:44:58,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=725410.0, ans=0.1 2023-10-11 13:44:58,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=725456.6666666666, ans=0.125 2023-10-11 13:45:00,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=725456.6666666666, ans=0.125 2023-10-11 13:45:06,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=725456.6666666666, ans=0.2 2023-10-11 13:45:39,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=725596.6666666666, ans=0.125 2023-10-11 13:45:59,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=725643.3333333334, ans=0.0 2023-10-11 13:46:08,775 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:46:17,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=725736.6666666666, ans=0.1 2023-10-11 13:46:33,352 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:46:52,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.709e+02 1.890e+02 2.124e+02 3.310e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-11 13:47:05,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=725923.3333333334, ans=0.125 2023-10-11 13:47:16,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=725970.0, ans=0.2 2023-10-11 13:47:29,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-10-11 13:47:33,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=726016.6666666666, ans=0.1 2023-10-11 13:47:58,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=726156.6666666666, ans=0.125 2023-10-11 13:48:12,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-10-11 13:48:22,977 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.68 vs. limit=10.0 2023-10-11 13:48:47,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726343.3333333334, ans=0.1 2023-10-11 13:48:49,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.716e+02 1.864e+02 2.153e+02 2.879e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-11 13:49:14,506 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.72 vs. limit=15.0 2023-10-11 13:49:19,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=726483.3333333334, ans=0.125 2023-10-11 13:49:42,623 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.74 vs. limit=15.0 2023-10-11 13:49:53,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=726623.3333333334, ans=0.125 2023-10-11 13:49:53,836 INFO [train.py:1031] (0/4) Epoch 12, batch 5500, loss[loss=0.2021, simple_loss=0.2864, pruned_loss=0.05884, over 16859.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2902, pruned_loss=0.05663, over 30704004.64 frames. ], batch size: 116, lr: 2.97e-03, grad_scale: 32.0 2023-10-11 13:49:55,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.20 vs. limit=10.0 2023-10-11 13:49:56,337 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.94 vs. limit=12.0 2023-10-11 13:50:07,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=726670.0, ans=0.125 2023-10-11 13:50:10,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=726670.0, ans=0.125 2023-10-11 13:50:11,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=726670.0, ans=0.1 2023-10-11 13:50:14,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=726670.0, ans=0.2 2023-10-11 13:50:28,963 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:50:31,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=726763.3333333334, ans=0.125 2023-10-11 13:50:37,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=726810.0, ans=0.0 2023-10-11 13:50:39,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.627e+02 1.796e+02 2.033e+02 2.468e+02, threshold=3.593e+02, percent-clipped=0.0 2023-10-11 13:50:46,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=726856.6666666666, ans=0.1 2023-10-11 13:50:51,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=726856.6666666666, ans=0.1 2023-10-11 13:51:21,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=726996.6666666666, ans=0.125 2023-10-11 13:51:22,147 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:51:34,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=727043.3333333334, ans=0.1 2023-10-11 13:51:41,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=727043.3333333334, ans=0.125 2023-10-11 13:51:43,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=727090.0, ans=0.1 2023-10-11 13:51:58,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=727136.6666666666, ans=0.04949747468305833 2023-10-11 13:52:04,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=727136.6666666666, ans=0.2 2023-10-11 13:52:19,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=727230.0, ans=0.0 2023-10-11 13:52:32,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.736e+02 1.937e+02 2.198e+02 4.352e+02, threshold=3.873e+02, percent-clipped=2.0 2023-10-11 13:52:40,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=727276.6666666666, ans=0.2 2023-10-11 13:53:11,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=727416.6666666666, ans=0.1 2023-10-11 13:53:34,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=727510.0, ans=0.0 2023-10-11 13:54:03,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=727650.0, ans=0.125 2023-10-11 13:54:04,079 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.35 vs. limit=10.0 2023-10-11 13:54:14,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.92 vs. limit=15.0 2023-10-11 13:54:21,185 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-10-11 13:54:27,670 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.697e+02 1.846e+02 2.049e+02 3.130e+02, threshold=3.692e+02, percent-clipped=0.0 2023-10-11 13:54:36,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=727790.0, ans=0.1 2023-10-11 13:55:04,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.81 vs. limit=5.0 2023-10-11 13:55:08,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=727883.3333333334, ans=0.0 2023-10-11 13:55:41,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=728023.3333333334, ans=0.0 2023-10-11 13:55:41,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.22 vs. limit=15.0 2023-10-11 13:55:54,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=728070.0, ans=0.0 2023-10-11 13:55:59,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=728116.6666666666, ans=0.125 2023-10-11 13:56:01,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=728116.6666666666, ans=0.125 2023-10-11 13:56:15,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=728163.3333333334, ans=0.125 2023-10-11 13:56:23,868 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.782e+02 1.966e+02 2.200e+02 3.503e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-11 13:56:44,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=728303.3333333334, ans=0.0 2023-10-11 13:56:45,218 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 13:56:45,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=728303.3333333334, ans=0.0 2023-10-11 13:56:51,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.11 vs. limit=10.0 2023-10-11 13:57:04,846 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.44 vs. limit=22.5 2023-10-11 13:57:07,164 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.08 vs. limit=15.0 2023-10-11 13:57:24,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=728443.3333333334, ans=0.125 2023-10-11 13:57:30,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=728490.0, ans=0.125 2023-10-11 13:57:30,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=728490.0, ans=0.0 2023-10-11 13:57:48,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-10-11 13:58:15,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=728630.0, ans=0.125 2023-10-11 13:58:20,209 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.595e+02 1.778e+02 1.991e+02 2.737e+02, threshold=3.556e+02, percent-clipped=0.0 2023-10-11 13:58:22,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.53 vs. limit=15.0 2023-10-11 13:58:22,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=728676.6666666666, ans=0.125 2023-10-11 13:58:52,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=728816.6666666666, ans=0.1 2023-10-11 13:58:55,994 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.49 vs. limit=15.0 2023-10-11 13:59:11,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=728910.0, ans=0.0 2023-10-11 13:59:15,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=728910.0, ans=0.125 2023-10-11 13:59:22,974 INFO [train.py:1031] (0/4) Epoch 12, batch 6000, loss[loss=0.2231, simple_loss=0.3047, pruned_loss=0.07074, over 16753.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2903, pruned_loss=0.05661, over 31163324.18 frames. ], batch size: 188, lr: 2.97e-03, grad_scale: 32.0 2023-10-11 13:59:34,519 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=22.5 2023-10-11 13:59:44,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=729003.3333333334, ans=0.125 2023-10-11 13:59:44,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.01 vs. limit=22.5 2023-10-11 13:59:51,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=729050.0, ans=0.125 2023-10-11 13:59:51,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=729050.0, ans=0.125 2023-10-11 14:00:03,413 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.89 vs. limit=15.0 2023-10-11 14:00:12,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.865e+02 2.197e+02 2.522e+02 3.719e+02, threshold=4.393e+02, percent-clipped=1.0 2023-10-11 14:00:12,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=729143.3333333334, ans=0.02 2023-10-11 14:00:48,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.90 vs. limit=22.5 2023-10-11 14:00:54,903 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2023-10-11 14:01:02,962 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:01:03,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=729376.6666666666, ans=0.2 2023-10-11 14:01:06,404 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-10-11 14:01:09,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=729376.6666666666, ans=0.1 2023-10-11 14:01:14,885 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.48 vs. limit=12.0 2023-10-11 14:01:38,699 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.33 vs. limit=22.5 2023-10-11 14:01:43,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=729516.6666666666, ans=0.0 2023-10-11 14:01:52,360 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.32 vs. limit=10.0 2023-10-11 14:02:04,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.315e+02 1.709e+02 1.866e+02 2.170e+02 3.352e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-11 14:02:18,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=729656.6666666666, ans=0.125 2023-10-11 14:02:33,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=729750.0, ans=0.125 2023-10-11 14:02:58,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=729843.3333333334, ans=0.125 2023-10-11 14:03:16,039 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.24 vs. limit=22.5 2023-10-11 14:03:21,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=729936.6666666666, ans=0.125 2023-10-11 14:03:27,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=729936.6666666666, ans=0.07 2023-10-11 14:03:35,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=22.5 2023-10-11 14:03:57,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.754e+02 2.000e+02 2.274e+02 3.192e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-11 14:04:12,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=730123.3333333334, ans=0.125 2023-10-11 14:04:12,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.48 vs. limit=10.0 2023-10-11 14:04:13,401 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.39 vs. limit=15.0 2023-10-11 14:04:14,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=730170.0, ans=0.0 2023-10-11 14:04:51,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.60 vs. limit=15.0 2023-10-11 14:05:03,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=730356.6666666666, ans=0.025 2023-10-11 14:05:06,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=730356.6666666666, ans=0.0 2023-10-11 14:05:13,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=730403.3333333334, ans=0.125 2023-10-11 14:05:22,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=730450.0, ans=0.2 2023-10-11 14:05:33,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=730450.0, ans=0.125 2023-10-11 14:05:52,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.833e+02 1.980e+02 2.227e+02 3.223e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-11 14:05:52,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=730543.3333333334, ans=0.125 2023-10-11 14:05:55,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=730543.3333333334, ans=0.2 2023-10-11 14:06:21,847 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:06:32,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.42 vs. limit=15.0 2023-10-11 14:06:50,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=730730.0, ans=0.2 2023-10-11 14:06:59,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=730776.6666666666, ans=0.125 2023-10-11 14:07:04,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=730823.3333333334, ans=0.1 2023-10-11 14:07:13,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=730823.3333333334, ans=0.125 2023-10-11 14:07:15,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=730870.0, ans=0.125 2023-10-11 14:07:31,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=730916.6666666666, ans=0.05 2023-10-11 14:07:37,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=730963.3333333334, ans=0.125 2023-10-11 14:07:40,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=730963.3333333334, ans=0.1 2023-10-11 14:07:52,807 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.607e+02 1.747e+02 1.965e+02 2.568e+02, threshold=3.494e+02, percent-clipped=0.0 2023-10-11 14:08:01,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=731056.6666666666, ans=0.125 2023-10-11 14:08:02,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=731056.6666666666, ans=0.0 2023-10-11 14:08:35,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=731196.6666666666, ans=0.125 2023-10-11 14:08:56,477 INFO [train.py:1031] (0/4) Epoch 12, batch 6500, loss[loss=0.1924, simple_loss=0.2848, pruned_loss=0.05, over 16868.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2907, pruned_loss=0.0568, over 31506683.42 frames. ], batch size: 72, lr: 2.96e-03, grad_scale: 32.0 2023-10-11 14:09:02,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=731290.0, ans=0.0 2023-10-11 14:09:03,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=731290.0, ans=0.2 2023-10-11 14:09:25,861 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0 2023-10-11 14:09:55,786 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.783e+02 1.995e+02 2.190e+02 3.130e+02, threshold=3.991e+02, percent-clipped=0.0 2023-10-11 14:10:04,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=731523.3333333334, ans=0.125 2023-10-11 14:10:31,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=731616.6666666666, ans=0.125 2023-10-11 14:10:33,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=731616.6666666666, ans=0.2 2023-10-11 14:10:48,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=731710.0, ans=0.09899494936611666 2023-10-11 14:10:50,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=731710.0, ans=0.125 2023-10-11 14:10:54,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=731710.0, ans=0.125 2023-10-11 14:10:56,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=731710.0, ans=0.125 2023-10-11 14:10:57,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=731710.0, ans=0.125 2023-10-11 14:11:01,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=731756.6666666666, ans=0.125 2023-10-11 14:11:10,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=731803.3333333334, ans=0.125 2023-10-11 14:11:25,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=731850.0, ans=0.125 2023-10-11 14:11:31,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=731850.0, ans=0.0 2023-10-11 14:11:35,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=731896.6666666666, ans=0.125 2023-10-11 14:11:46,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.12 vs. limit=15.0 2023-10-11 14:11:47,369 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.706e+02 1.869e+02 2.128e+02 3.077e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-11 14:11:49,000 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.04 vs. limit=12.0 2023-10-11 14:11:50,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=731943.3333333334, ans=0.0 2023-10-11 14:12:11,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=732036.6666666666, ans=0.125 2023-10-11 14:12:21,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=732083.3333333334, ans=0.1 2023-10-11 14:12:27,322 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:12:32,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=732130.0, ans=0.1 2023-10-11 14:12:43,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=732176.6666666666, ans=0.125 2023-10-11 14:13:06,231 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.98 vs. limit=15.0 2023-10-11 14:13:09,924 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.86 vs. limit=22.5 2023-10-11 14:13:17,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=732316.6666666666, ans=0.125 2023-10-11 14:13:35,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.703e+02 1.919e+02 2.239e+02 3.995e+02, threshold=3.839e+02, percent-clipped=1.0 2023-10-11 14:13:53,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=732456.6666666666, ans=0.125 2023-10-11 14:14:12,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=732550.0, ans=0.2 2023-10-11 14:14:17,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.39 vs. limit=15.0 2023-10-11 14:14:38,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=732643.3333333334, ans=0.125 2023-10-11 14:14:48,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=732690.0, ans=0.125 2023-10-11 14:15:09,997 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.93 vs. limit=15.0 2023-10-11 14:15:24,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.14 vs. limit=15.0 2023-10-11 14:15:24,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=732783.3333333334, ans=0.125 2023-10-11 14:15:40,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=732876.6666666666, ans=0.0 2023-10-11 14:15:42,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.261e+02 1.617e+02 1.881e+02 2.174e+02 3.473e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-11 14:15:58,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=732923.3333333334, ans=0.125 2023-10-11 14:16:27,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=733063.3333333334, ans=0.125 2023-10-11 14:16:30,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=733063.3333333334, ans=0.125 2023-10-11 14:16:37,421 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.24 vs. limit=15.0 2023-10-11 14:16:50,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=22.5 2023-10-11 14:16:58,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=733203.3333333334, ans=0.1 2023-10-11 14:17:02,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=733203.3333333334, ans=0.125 2023-10-11 14:17:09,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=733203.3333333334, ans=0.05 2023-10-11 14:17:37,186 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.679e+02 1.827e+02 2.286e+02 3.883e+02, threshold=3.654e+02, percent-clipped=1.0 2023-10-11 14:17:41,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=733343.3333333334, ans=0.125 2023-10-11 14:17:56,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=733436.6666666666, ans=0.2 2023-10-11 14:18:12,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=733483.3333333334, ans=0.125 2023-10-11 14:18:33,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=733576.6666666666, ans=0.2 2023-10-11 14:18:35,390 INFO [train.py:1031] (0/4) Epoch 12, batch 7000, loss[loss=0.2156, simple_loss=0.301, pruned_loss=0.06515, over 16579.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2909, pruned_loss=0.05649, over 31797782.56 frames. ], batch size: 56, lr: 2.96e-03, grad_scale: 32.0 2023-10-11 14:18:44,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=733623.3333333334, ans=0.125 2023-10-11 14:18:44,472 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.42 vs. limit=22.5 2023-10-11 14:18:56,399 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-10-11 14:19:18,398 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.63 vs. limit=10.0 2023-10-11 14:19:26,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.756e+02 1.979e+02 2.194e+02 3.652e+02, threshold=3.958e+02, percent-clipped=1.0 2023-10-11 14:19:29,136 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.96 vs. limit=12.0 2023-10-11 14:19:44,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.53 vs. limit=22.5 2023-10-11 14:19:52,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=733950.0, ans=0.025 2023-10-11 14:19:53,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=12.0 2023-10-11 14:19:55,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=733950.0, ans=0.125 2023-10-11 14:20:20,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=734043.3333333334, ans=0.125 2023-10-11 14:20:25,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=734090.0, ans=0.125 2023-10-11 14:20:34,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=734136.6666666666, ans=0.125 2023-10-11 14:20:46,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=734183.3333333334, ans=0.125 2023-10-11 14:20:51,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=734183.3333333334, ans=0.125 2023-10-11 14:21:15,514 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.703e+02 1.906e+02 2.130e+02 2.959e+02, threshold=3.812e+02, percent-clipped=0.0 2023-10-11 14:21:18,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=734276.6666666666, ans=0.125 2023-10-11 14:21:19,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=734276.6666666666, ans=0.125 2023-10-11 14:21:20,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=734323.3333333334, ans=0.125 2023-10-11 14:21:38,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=734370.0, ans=0.125 2023-10-11 14:21:41,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=734416.6666666666, ans=0.0 2023-10-11 14:21:43,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=734416.6666666666, ans=0.125 2023-10-11 14:21:46,173 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=15.0 2023-10-11 14:22:06,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=734510.0, ans=0.0 2023-10-11 14:22:35,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=734603.3333333334, ans=0.125 2023-10-11 14:22:38,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=734603.3333333334, ans=0.1 2023-10-11 14:22:45,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=734603.3333333334, ans=0.125 2023-10-11 14:22:55,148 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.01 vs. limit=22.5 2023-10-11 14:22:56,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.17 vs. limit=15.0 2023-10-11 14:23:17,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.732e+02 1.991e+02 2.166e+02 2.951e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-11 14:23:33,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=734790.0, ans=0.125 2023-10-11 14:23:34,483 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:23:40,292 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=12.0 2023-10-11 14:23:44,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=734836.6666666666, ans=0.125 2023-10-11 14:23:47,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=734883.3333333334, ans=0.125 2023-10-11 14:23:49,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=734883.3333333334, ans=0.0 2023-10-11 14:24:02,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=734930.0, ans=0.2 2023-10-11 14:24:10,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=734976.6666666666, ans=0.125 2023-10-11 14:25:04,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=735163.3333333334, ans=0.125 2023-10-11 14:25:06,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=735163.3333333334, ans=0.1 2023-10-11 14:25:15,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=735210.0, ans=0.125 2023-10-11 14:25:15,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.630e+02 1.776e+02 2.076e+02 3.236e+02, threshold=3.552e+02, percent-clipped=0.0 2023-10-11 14:25:26,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.33 vs. limit=15.0 2023-10-11 14:25:37,389 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.35 vs. limit=12.0 2023-10-11 14:25:38,542 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.23 vs. limit=10.0 2023-10-11 14:25:42,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=735303.3333333334, ans=0.1 2023-10-11 14:25:45,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=735350.0, ans=0.125 2023-10-11 14:25:52,502 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.66 vs. limit=6.0 2023-10-11 14:25:53,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=735350.0, ans=0.125 2023-10-11 14:25:59,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=735396.6666666666, ans=0.09899494936611666 2023-10-11 14:26:00,199 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=4.173e-02 2023-10-11 14:26:08,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=735443.3333333334, ans=0.125 2023-10-11 14:26:09,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=735443.3333333334, ans=0.2 2023-10-11 14:26:38,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=735536.6666666666, ans=0.0 2023-10-11 14:26:57,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=735630.0, ans=0.125 2023-10-11 14:26:58,090 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.25 vs. limit=15.0 2023-10-11 14:27:08,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.858e+02 2.131e+02 2.393e+02 3.396e+02, threshold=4.263e+02, percent-clipped=0.0 2023-10-11 14:27:09,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=735676.6666666666, ans=0.125 2023-10-11 14:27:32,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=735770.0, ans=0.125 2023-10-11 14:27:51,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=735863.3333333334, ans=0.0 2023-10-11 14:28:10,617 INFO [train.py:1031] (0/4) Epoch 12, batch 7500, loss[loss=0.2072, simple_loss=0.2941, pruned_loss=0.06016, over 16870.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2907, pruned_loss=0.05658, over 31989008.96 frames. ], batch size: 116, lr: 2.95e-03, grad_scale: 32.0 2023-10-11 14:28:26,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=736003.3333333334, ans=0.125 2023-10-11 14:28:33,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=736050.0, ans=0.125 2023-10-11 14:28:37,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=736050.0, ans=0.0 2023-10-11 14:28:45,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=736096.6666666666, ans=0.125 2023-10-11 14:28:59,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.764e+02 1.964e+02 2.248e+02 3.194e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-11 14:29:10,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=736190.0, ans=0.125 2023-10-11 14:29:10,691 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.64 vs. limit=22.5 2023-10-11 14:29:23,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=736236.6666666666, ans=0.125 2023-10-11 14:29:30,017 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.38 vs. limit=15.0 2023-10-11 14:29:39,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=736330.0, ans=0.1 2023-10-11 14:29:52,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=736376.6666666666, ans=0.125 2023-10-11 14:29:53,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=736376.6666666666, ans=0.125 2023-10-11 14:30:05,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=736423.3333333334, ans=0.0 2023-10-11 14:30:18,877 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.48 vs. limit=22.5 2023-10-11 14:31:02,995 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.689e+02 1.866e+02 2.163e+02 2.917e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-11 14:31:04,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=736610.0, ans=0.125 2023-10-11 14:31:07,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=736610.0, ans=0.07 2023-10-11 14:31:36,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=736750.0, ans=0.1 2023-10-11 14:31:37,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=736750.0, ans=0.05 2023-10-11 14:31:49,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=736796.6666666666, ans=0.1 2023-10-11 14:32:15,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=12.0 2023-10-11 14:32:15,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=736936.6666666666, ans=0.125 2023-10-11 14:32:18,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=736936.6666666666, ans=0.125 2023-10-11 14:32:40,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=737030.0, ans=0.125 2023-10-11 14:32:54,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.666e+02 1.818e+02 2.033e+02 2.567e+02, threshold=3.636e+02, percent-clipped=0.0 2023-10-11 14:33:04,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=737123.3333333334, ans=0.125 2023-10-11 14:33:08,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=737123.3333333334, ans=0.1 2023-10-11 14:33:22,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=737216.6666666666, ans=0.0 2023-10-11 14:33:29,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=737216.6666666666, ans=0.0 2023-10-11 14:33:33,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=737263.3333333334, ans=0.0 2023-10-11 14:33:51,131 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-10-11 14:33:56,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=737356.6666666666, ans=0.125 2023-10-11 14:34:08,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=737403.3333333334, ans=0.0 2023-10-11 14:34:12,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=737403.3333333334, ans=0.125 2023-10-11 14:34:20,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=737403.3333333334, ans=10.0 2023-10-11 14:34:53,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.740e+02 1.906e+02 2.061e+02 2.765e+02, threshold=3.813e+02, percent-clipped=0.0 2023-10-11 14:34:54,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.66 vs. limit=15.0 2023-10-11 14:35:09,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=737590.0, ans=0.1 2023-10-11 14:35:18,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=737636.6666666666, ans=0.2 2023-10-11 14:35:21,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=737683.3333333334, ans=0.1 2023-10-11 14:35:37,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=737730.0, ans=0.125 2023-10-11 14:35:59,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=737823.3333333334, ans=0.2 2023-10-11 14:36:01,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=737823.3333333334, ans=0.125 2023-10-11 14:36:34,582 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-10-11 14:36:46,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.635e+02 1.890e+02 2.195e+02 3.054e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-11 14:37:04,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=738103.3333333334, ans=0.0 2023-10-11 14:37:40,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=738243.3333333334, ans=0.1 2023-10-11 14:37:51,588 INFO [train.py:1031] (0/4) Epoch 12, batch 8000, loss[loss=0.1774, simple_loss=0.2785, pruned_loss=0.03813, over 16908.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2903, pruned_loss=0.05613, over 32181042.27 frames. ], batch size: 77, lr: 2.95e-03, grad_scale: 32.0 2023-10-11 14:38:09,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.93 vs. limit=22.5 2023-10-11 14:38:16,336 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.37 vs. limit=15.0 2023-10-11 14:38:35,087 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=22.5 2023-10-11 14:38:39,756 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.619e+02 1.745e+02 2.050e+02 3.654e+02, threshold=3.490e+02, percent-clipped=0.0 2023-10-11 14:38:44,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=738523.3333333334, ans=0.0 2023-10-11 14:38:47,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=738523.3333333334, ans=0.125 2023-10-11 14:38:58,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=738570.0, ans=0.05 2023-10-11 14:39:02,260 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-10-11 14:39:15,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=738616.6666666666, ans=0.2 2023-10-11 14:39:15,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=738616.6666666666, ans=15.0 2023-10-11 14:39:32,821 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-10-11 14:39:52,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=738803.3333333334, ans=0.2 2023-10-11 14:39:57,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=738803.3333333334, ans=0.125 2023-10-11 14:40:25,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.690e+02 1.907e+02 2.170e+02 3.150e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-11 14:40:39,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=738990.0, ans=0.0 2023-10-11 14:40:48,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=738990.0, ans=0.0 2023-10-11 14:41:05,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=739036.6666666666, ans=0.1 2023-10-11 14:41:07,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=739036.6666666666, ans=0.0 2023-10-11 14:41:07,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=739036.6666666666, ans=0.0 2023-10-11 14:42:19,768 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-10-11 14:42:26,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.72 vs. limit=22.5 2023-10-11 14:42:30,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=739363.3333333334, ans=0.035 2023-10-11 14:42:33,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=739363.3333333334, ans=0.1 2023-10-11 14:42:38,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=739410.0, ans=0.0 2023-10-11 14:42:39,742 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.688e+02 1.880e+02 2.176e+02 3.288e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-11 14:42:42,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=739410.0, ans=0.0 2023-10-11 14:42:54,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=739456.6666666666, ans=0.1 2023-10-11 14:43:10,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.85 vs. limit=22.5 2023-10-11 14:43:16,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=739550.0, ans=0.1 2023-10-11 14:43:20,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=739596.6666666666, ans=0.0 2023-10-11 14:43:37,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=739643.3333333334, ans=0.2 2023-10-11 14:43:41,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-10-11 14:43:56,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=739736.6666666666, ans=0.0 2023-10-11 14:44:03,814 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:44:07,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=739783.3333333334, ans=0.2 2023-10-11 14:44:10,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=739783.3333333334, ans=0.1 2023-10-11 14:44:16,615 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-10-11 14:44:32,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.791e+02 2.007e+02 2.336e+02 3.042e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-11 14:44:38,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=739876.6666666666, ans=0.125 2023-10-11 14:44:41,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=739923.3333333334, ans=0.125 2023-10-11 14:44:49,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=739923.3333333334, ans=0.1 2023-10-11 14:44:54,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=739970.0, ans=0.0 2023-10-11 14:45:09,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=740016.6666666666, ans=0.07 2023-10-11 14:45:27,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=740110.0, ans=10.0 2023-10-11 14:46:08,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=740250.0, ans=0.1 2023-10-11 14:46:32,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.682e+02 1.826e+02 2.033e+02 3.552e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-11 14:46:33,674 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.44 vs. limit=5.0 2023-10-11 14:46:43,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=740390.0, ans=0.0 2023-10-11 14:46:47,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=740390.0, ans=0.125 2023-10-11 14:46:59,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=740436.6666666666, ans=0.0 2023-10-11 14:47:01,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=740483.3333333334, ans=0.05 2023-10-11 14:47:01,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=740483.3333333334, ans=0.125 2023-10-11 14:47:19,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=740530.0, ans=0.125 2023-10-11 14:47:33,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=740576.6666666666, ans=0.0 2023-10-11 14:47:40,907 INFO [train.py:1031] (0/4) Epoch 12, batch 8500, loss[loss=0.2091, simple_loss=0.2986, pruned_loss=0.05979, over 16869.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2905, pruned_loss=0.05599, over 32299896.06 frames. ], batch size: 72, lr: 2.94e-03, grad_scale: 32.0 2023-10-11 14:47:53,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.88 vs. limit=22.5 2023-10-11 14:48:00,309 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:48:07,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=740716.6666666666, ans=0.1 2023-10-11 14:48:24,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.66 vs. limit=15.0 2023-10-11 14:48:26,070 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=12.0 2023-10-11 14:48:34,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.746e+02 1.916e+02 2.116e+02 2.668e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-11 14:48:55,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=740903.3333333334, ans=0.125 2023-10-11 14:49:03,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=740950.0, ans=0.125 2023-10-11 14:49:05,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=740950.0, ans=0.025 2023-10-11 14:49:12,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=740950.0, ans=0.125 2023-10-11 14:49:16,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=740996.6666666666, ans=0.0 2023-10-11 14:49:35,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=741043.3333333334, ans=0.0 2023-10-11 14:49:43,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=741090.0, ans=0.125 2023-10-11 14:50:04,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=741136.6666666666, ans=0.2 2023-10-11 14:50:04,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.19 vs. limit=15.0 2023-10-11 14:50:11,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=741183.3333333334, ans=0.1 2023-10-11 14:50:15,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=741183.3333333334, ans=15.0 2023-10-11 14:50:15,531 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:50:21,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=741230.0, ans=0.0 2023-10-11 14:50:26,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=741230.0, ans=0.125 2023-10-11 14:50:35,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.665e+02 1.846e+02 2.084e+02 3.076e+02, threshold=3.692e+02, percent-clipped=0.0 2023-10-11 14:50:42,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=741323.3333333334, ans=0.1 2023-10-11 14:50:54,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=741370.0, ans=0.0 2023-10-11 14:50:54,427 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.99 vs. limit=6.0 2023-10-11 14:51:02,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=741370.0, ans=0.125 2023-10-11 14:51:16,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=741416.6666666666, ans=0.0 2023-10-11 14:51:49,527 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 14:51:50,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=741556.6666666666, ans=0.09899494936611666 2023-10-11 14:51:52,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-10-11 14:52:01,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=741603.3333333334, ans=0.125 2023-10-11 14:52:14,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-10-11 14:52:18,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=741650.0, ans=0.125 2023-10-11 14:52:21,970 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.37 vs. limit=15.0 2023-10-11 14:52:25,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=741696.6666666666, ans=0.2 2023-10-11 14:52:38,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.617e+02 1.789e+02 2.064e+02 3.014e+02, threshold=3.579e+02, percent-clipped=0.0 2023-10-11 14:52:42,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=741790.0, ans=0.0 2023-10-11 14:52:43,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=741790.0, ans=0.125 2023-10-11 14:52:45,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=741790.0, ans=0.07 2023-10-11 14:52:54,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=741790.0, ans=0.1 2023-10-11 14:52:54,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.09 vs. limit=15.0 2023-10-11 14:53:22,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=741930.0, ans=0.1 2023-10-11 14:53:23,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=741930.0, ans=0.0 2023-10-11 14:54:21,219 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.62 vs. limit=10.0 2023-10-11 14:54:32,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.728e+02 1.948e+02 2.312e+02 3.393e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-11 14:54:45,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=742256.6666666666, ans=0.125 2023-10-11 14:54:47,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=742303.3333333334, ans=0.125 2023-10-11 14:54:47,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=742303.3333333334, ans=0.2 2023-10-11 14:54:56,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=742303.3333333334, ans=0.125 2023-10-11 14:54:58,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=742350.0, ans=0.025 2023-10-11 14:55:39,009 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=15.0 2023-10-11 14:55:48,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=742536.6666666666, ans=0.0 2023-10-11 14:56:02,280 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.94 vs. limit=15.0 2023-10-11 14:56:07,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=742630.0, ans=0.125 2023-10-11 14:56:16,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=742676.6666666666, ans=0.125 2023-10-11 14:56:16,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=742676.6666666666, ans=0.0 2023-10-11 14:56:21,815 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.680e+02 1.806e+02 2.017e+02 3.495e+02, threshold=3.612e+02, percent-clipped=0.0 2023-10-11 14:56:31,528 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.09 vs. limit=15.0 2023-10-11 14:56:33,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=742723.3333333334, ans=0.0 2023-10-11 14:56:33,397 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-10-11 14:56:34,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=742723.3333333334, ans=0.95 2023-10-11 14:56:46,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=742770.0, ans=0.09899494936611666 2023-10-11 14:57:01,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=742863.3333333334, ans=0.1 2023-10-11 14:57:03,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=742863.3333333334, ans=0.125 2023-10-11 14:57:09,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=742863.3333333334, ans=0.125 2023-10-11 14:57:23,376 INFO [train.py:1031] (0/4) Epoch 12, batch 9000, loss[loss=0.208, simple_loss=0.2985, pruned_loss=0.0587, over 16943.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.29, pruned_loss=0.05586, over 32393934.04 frames. ], batch size: 123, lr: 2.94e-03, grad_scale: 32.0 2023-10-11 14:57:29,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=742956.6666666666, ans=0.0 2023-10-11 14:57:34,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=743003.3333333334, ans=0.0 2023-10-11 14:57:48,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=743050.0, ans=0.0 2023-10-11 14:58:10,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=743143.3333333334, ans=0.05 2023-10-11 14:58:12,684 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.730e+02 1.950e+02 2.117e+02 2.843e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-11 14:58:16,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=743143.3333333334, ans=0.2 2023-10-11 14:58:17,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=743143.3333333334, ans=0.125 2023-10-11 14:58:24,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=743190.0, ans=0.125 2023-10-11 14:58:30,827 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.32 vs. limit=15.0 2023-10-11 14:58:36,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=743236.6666666666, ans=0.0 2023-10-11 14:58:39,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=743236.6666666666, ans=0.0 2023-10-11 14:58:56,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=743330.0, ans=0.125 2023-10-11 14:59:27,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=743470.0, ans=0.1 2023-10-11 14:59:40,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=743516.6666666666, ans=0.125 2023-10-11 14:59:46,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=743563.3333333334, ans=0.0 2023-10-11 15:00:02,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.279e+02 1.674e+02 1.926e+02 2.088e+02 3.271e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-11 15:00:12,671 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.78 vs. limit=12.0 2023-10-11 15:00:14,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=743656.6666666666, ans=0.1 2023-10-11 15:00:21,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.25 vs. limit=15.0 2023-10-11 15:00:35,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=743750.0, ans=0.0 2023-10-11 15:00:58,588 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.63 vs. limit=12.0 2023-10-11 15:01:11,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=743936.6666666666, ans=0.0 2023-10-11 15:01:13,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=743936.6666666666, ans=0.1 2023-10-11 15:01:15,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=743936.6666666666, ans=0.0 2023-10-11 15:01:17,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=743936.6666666666, ans=0.125 2023-10-11 15:01:18,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=743983.3333333334, ans=0.125 2023-10-11 15:01:32,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=744030.0, ans=0.125 2023-10-11 15:01:46,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.709e+02 1.889e+02 2.081e+02 3.061e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-11 15:01:53,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=744123.3333333334, ans=0.125 2023-10-11 15:01:59,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=744123.3333333334, ans=0.1 2023-10-11 15:01:59,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=744123.3333333334, ans=0.1 2023-10-11 15:02:03,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=744170.0, ans=0.1 2023-10-11 15:02:27,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=744263.3333333334, ans=0.07 2023-10-11 15:02:38,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=744310.0, ans=0.125 2023-10-11 15:03:33,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=744496.6666666666, ans=0.125 2023-10-11 15:03:34,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=744543.3333333334, ans=0.125 2023-10-11 15:03:36,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=744543.3333333334, ans=0.0 2023-10-11 15:03:40,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=744543.3333333334, ans=0.125 2023-10-11 15:03:41,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.816e+02 2.254e+02 2.473e+02 3.370e+02, threshold=4.509e+02, percent-clipped=0.0 2023-10-11 15:03:43,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=744543.3333333334, ans=0.125 2023-10-11 15:03:53,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=744590.0, ans=0.125 2023-10-11 15:04:19,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=744683.3333333334, ans=0.125 2023-10-11 15:04:25,410 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:04:29,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=744730.0, ans=0.0 2023-10-11 15:04:55,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=744823.3333333334, ans=0.125 2023-10-11 15:05:02,842 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-10-11 15:05:26,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.60 vs. limit=10.0 2023-10-11 15:05:40,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.213e+02 1.697e+02 1.864e+02 2.071e+02 3.050e+02, threshold=3.728e+02, percent-clipped=0.0 2023-10-11 15:05:45,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=745056.6666666666, ans=0.125 2023-10-11 15:06:08,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.90 vs. limit=15.0 2023-10-11 15:06:37,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=745243.3333333334, ans=0.125 2023-10-11 15:06:44,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-10-11 15:06:46,615 INFO [train.py:1031] (0/4) Epoch 12, batch 9500, loss[loss=0.1803, simple_loss=0.2738, pruned_loss=0.04342, over 16311.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2907, pruned_loss=0.05614, over 32467413.19 frames. ], batch size: 50, lr: 2.93e-03, grad_scale: 32.0 2023-10-11 15:06:50,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=745290.0, ans=0.1 2023-10-11 15:06:56,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=745290.0, ans=0.0 2023-10-11 15:07:05,298 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-10-11 15:07:11,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=745383.3333333334, ans=0.1 2023-10-11 15:07:38,609 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.667e+02 1.825e+02 2.093e+02 2.768e+02, threshold=3.650e+02, percent-clipped=0.0 2023-10-11 15:07:42,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=745523.3333333334, ans=0.125 2023-10-11 15:07:45,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=745523.3333333334, ans=0.125 2023-10-11 15:08:08,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=745616.6666666666, ans=0.1 2023-10-11 15:08:43,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.48 vs. limit=10.0 2023-10-11 15:08:54,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=745803.3333333334, ans=0.1 2023-10-11 15:09:09,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=745850.0, ans=0.0 2023-10-11 15:09:10,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=745850.0, ans=15.0 2023-10-11 15:09:28,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.753e+02 1.917e+02 2.218e+02 3.224e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-11 15:09:43,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=745990.0, ans=0.09899494936611666 2023-10-11 15:10:10,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=746130.0, ans=0.0 2023-10-11 15:10:13,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=746130.0, ans=0.0 2023-10-11 15:10:21,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=746176.6666666666, ans=0.125 2023-10-11 15:10:22,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=746176.6666666666, ans=0.1 2023-10-11 15:10:22,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=746176.6666666666, ans=0.0 2023-10-11 15:11:05,272 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:11:10,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=746363.3333333334, ans=0.125 2023-10-11 15:11:23,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.709e+02 1.859e+02 2.130e+02 3.270e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-11 15:11:39,376 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-10-11 15:11:44,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=746503.3333333334, ans=0.125 2023-10-11 15:11:45,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=746503.3333333334, ans=0.125 2023-10-11 15:11:46,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=746503.3333333334, ans=0.125 2023-10-11 15:11:48,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=746503.3333333334, ans=0.0 2023-10-11 15:11:51,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=746550.0, ans=0.0 2023-10-11 15:11:57,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=746550.0, ans=0.2 2023-10-11 15:12:10,923 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.67 vs. limit=15.0 2023-10-11 15:12:11,740 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-11 15:12:13,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=746643.3333333334, ans=0.0 2023-10-11 15:12:19,108 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-160000.pt 2023-10-11 15:12:40,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=746736.6666666666, ans=0.1 2023-10-11 15:13:05,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=746830.0, ans=0.125 2023-10-11 15:13:19,262 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.282e+02 1.717e+02 1.882e+02 2.097e+02 2.818e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-11 15:13:35,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=746970.0, ans=0.125 2023-10-11 15:13:55,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=747063.3333333334, ans=0.125 2023-10-11 15:14:05,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=747063.3333333334, ans=0.0 2023-10-11 15:14:13,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=747110.0, ans=0.1 2023-10-11 15:14:51,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=747296.6666666666, ans=0.5 2023-10-11 15:14:53,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=747296.6666666666, ans=0.0 2023-10-11 15:15:09,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.614e+02 1.799e+02 1.919e+02 2.970e+02, threshold=3.599e+02, percent-clipped=0.0 2023-10-11 15:15:11,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=747343.3333333334, ans=0.125 2023-10-11 15:15:26,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=747436.6666666666, ans=0.0 2023-10-11 15:15:32,838 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-10-11 15:15:42,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=747483.3333333334, ans=0.0 2023-10-11 15:15:43,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=747483.3333333334, ans=0.95 2023-10-11 15:15:54,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=747530.0, ans=0.1 2023-10-11 15:16:03,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=747576.6666666666, ans=0.125 2023-10-11 15:16:09,613 INFO [train.py:1031] (0/4) Epoch 12, batch 10000, loss[loss=0.2049, simple_loss=0.2986, pruned_loss=0.05558, over 16951.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2898, pruned_loss=0.05581, over 32525061.94 frames. ], batch size: 165, lr: 2.93e-03, grad_scale: 32.0 2023-10-11 15:16:25,526 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:16:27,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=747670.0, ans=0.0 2023-10-11 15:16:28,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=747670.0, ans=0.0 2023-10-11 15:16:32,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=747716.6666666666, ans=0.0 2023-10-11 15:16:34,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=747716.6666666666, ans=0.1 2023-10-11 15:16:39,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=747763.3333333334, ans=0.125 2023-10-11 15:16:59,423 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.720e+02 1.904e+02 2.162e+02 2.851e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-11 15:17:01,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=747810.0, ans=10.0 2023-10-11 15:17:05,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.26 vs. limit=22.5 2023-10-11 15:17:17,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=747903.3333333334, ans=0.125 2023-10-11 15:17:26,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=747950.0, ans=0.125 2023-10-11 15:17:48,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=747996.6666666666, ans=0.0 2023-10-11 15:17:49,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=748043.3333333334, ans=0.0 2023-10-11 15:17:54,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748043.3333333334, ans=0.1 2023-10-11 15:18:00,943 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-10-11 15:18:01,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-10-11 15:18:12,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-10-11 15:18:29,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=748183.3333333334, ans=0.2 2023-10-11 15:18:35,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=748230.0, ans=0.0 2023-10-11 15:18:41,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=748230.0, ans=0.125 2023-10-11 15:18:43,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=748230.0, ans=0.125 2023-10-11 15:18:46,175 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.23 vs. limit=15.0 2023-10-11 15:18:49,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=748276.6666666666, ans=0.125 2023-10-11 15:18:53,619 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.741e+02 2.039e+02 2.223e+02 2.984e+02, threshold=4.078e+02, percent-clipped=0.0 2023-10-11 15:20:04,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=748556.6666666666, ans=0.2 2023-10-11 15:20:06,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=748603.3333333334, ans=0.0 2023-10-11 15:20:09,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=748603.3333333334, ans=0.125 2023-10-11 15:20:48,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.704e+02 1.834e+02 2.011e+02 2.765e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-11 15:20:49,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=748743.3333333334, ans=0.125 2023-10-11 15:20:52,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=748790.0, ans=0.125 2023-10-11 15:20:53,607 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:20:54,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=748790.0, ans=0.125 2023-10-11 15:21:01,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=748790.0, ans=0.125 2023-10-11 15:21:01,702 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=12.0 2023-10-11 15:21:04,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=748836.6666666666, ans=0.125 2023-10-11 15:21:05,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=748836.6666666666, ans=0.125 2023-10-11 15:21:09,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=748836.6666666666, ans=0.1 2023-10-11 15:21:41,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.54 vs. limit=22.5 2023-10-11 15:21:42,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=748976.6666666666, ans=0.1 2023-10-11 15:21:50,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=748976.6666666666, ans=0.125 2023-10-11 15:22:07,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=749070.0, ans=0.95 2023-10-11 15:22:19,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.57 vs. limit=15.0 2023-10-11 15:22:26,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=749163.3333333334, ans=0.125 2023-10-11 15:22:28,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=749163.3333333334, ans=0.125 2023-10-11 15:22:33,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.84 vs. limit=10.0 2023-10-11 15:22:44,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=749210.0, ans=0.125 2023-10-11 15:22:45,542 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.656e+02 1.817e+02 2.081e+02 2.546e+02, threshold=3.634e+02, percent-clipped=0.0 2023-10-11 15:22:45,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=749210.0, ans=0.2 2023-10-11 15:22:46,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=749210.0, ans=0.09899494936611666 2023-10-11 15:22:56,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=749256.6666666666, ans=0.125 2023-10-11 15:23:46,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=749443.3333333334, ans=0.125 2023-10-11 15:24:04,110 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.27 vs. limit=12.0 2023-10-11 15:24:32,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=749630.0, ans=0.125 2023-10-11 15:24:35,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=749630.0, ans=0.125 2023-10-11 15:24:45,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.648e+02 1.834e+02 2.097e+02 3.063e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-11 15:24:52,145 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=12.0 2023-10-11 15:25:15,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=749816.6666666666, ans=0.0 2023-10-11 15:25:19,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=749816.6666666666, ans=0.125 2023-10-11 15:25:26,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.07 vs. limit=15.0 2023-10-11 15:25:27,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.60 vs. limit=22.5 2023-10-11 15:25:30,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=749863.3333333334, ans=12.0 2023-10-11 15:25:31,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=749863.3333333334, ans=0.125 2023-10-11 15:25:44,955 INFO [train.py:1031] (0/4) Epoch 12, batch 10500, loss[loss=0.2184, simple_loss=0.3054, pruned_loss=0.06568, over 16443.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2904, pruned_loss=0.05592, over 32583917.53 frames. ], batch size: 266, lr: 2.92e-03, grad_scale: 32.0 2023-10-11 15:26:07,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=750050.0, ans=0.2 2023-10-11 15:26:21,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=750096.6666666666, ans=0.1 2023-10-11 15:26:33,640 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.672e+02 1.918e+02 2.122e+02 3.044e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-11 15:27:07,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=750283.3333333334, ans=0.125 2023-10-11 15:27:09,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.58 vs. limit=22.5 2023-10-11 15:27:20,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=750330.0, ans=0.125 2023-10-11 15:27:21,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=750330.0, ans=0.125 2023-10-11 15:27:23,354 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.23 vs. limit=10.0 2023-10-11 15:27:27,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=750330.0, ans=0.025 2023-10-11 15:27:30,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=750376.6666666666, ans=0.5 2023-10-11 15:27:33,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=750376.6666666666, ans=0.035 2023-10-11 15:27:37,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=750376.6666666666, ans=0.09899494936611666 2023-10-11 15:28:01,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=750470.0, ans=0.1 2023-10-11 15:28:35,205 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.725e+02 1.833e+02 2.012e+02 2.637e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-11 15:28:37,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=750656.6666666666, ans=0.125 2023-10-11 15:28:59,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=750750.0, ans=0.125 2023-10-11 15:29:04,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=750750.0, ans=0.2 2023-10-11 15:29:04,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.56 vs. limit=15.0 2023-10-11 15:29:14,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=750796.6666666666, ans=0.125 2023-10-11 15:29:18,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=750796.6666666666, ans=0.1 2023-10-11 15:29:22,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.76 vs. limit=22.5 2023-10-11 15:29:26,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=750843.3333333334, ans=0.125 2023-10-11 15:29:45,365 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:29:46,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=750936.6666666666, ans=0.125 2023-10-11 15:29:47,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=750936.6666666666, ans=0.0 2023-10-11 15:30:07,400 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:30:27,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.377e+02 1.664e+02 1.787e+02 1.988e+02 2.570e+02, threshold=3.575e+02, percent-clipped=0.0 2023-10-11 15:31:00,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=751216.6666666666, ans=0.2 2023-10-11 15:31:08,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-10-11 15:31:09,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.87 vs. limit=22.5 2023-10-11 15:31:17,401 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=12.0 2023-10-11 15:31:34,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=751356.6666666666, ans=0.0 2023-10-11 15:31:40,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=751403.3333333334, ans=0.0 2023-10-11 15:32:04,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=751496.6666666666, ans=22.5 2023-10-11 15:32:09,886 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-10-11 15:32:11,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=751543.3333333334, ans=0.125 2023-10-11 15:32:17,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.748e+02 1.895e+02 2.139e+02 3.390e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-11 15:32:33,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=751636.6666666666, ans=0.125 2023-10-11 15:32:59,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=751730.0, ans=0.125 2023-10-11 15:33:12,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=751776.6666666666, ans=0.04949747468305833 2023-10-11 15:33:14,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=751776.6666666666, ans=0.09899494936611666 2023-10-11 15:33:14,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=751776.6666666666, ans=0.0 2023-10-11 15:33:37,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.83 vs. limit=22.5 2023-10-11 15:33:45,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=751916.6666666666, ans=0.1 2023-10-11 15:33:52,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=751963.3333333334, ans=0.0 2023-10-11 15:34:00,184 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.44 vs. limit=10.0 2023-10-11 15:34:10,131 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.295e+02 1.580e+02 1.725e+02 1.875e+02 2.764e+02, threshold=3.450e+02, percent-clipped=0.0 2023-10-11 15:34:21,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=752056.6666666666, ans=0.125 2023-10-11 15:34:25,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=752103.3333333334, ans=0.0 2023-10-11 15:34:27,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=752103.3333333334, ans=0.125 2023-10-11 15:34:35,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=752150.0, ans=0.125 2023-10-11 15:34:44,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=752196.6666666666, ans=0.05 2023-10-11 15:34:49,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=752196.6666666666, ans=0.0 2023-10-11 15:35:04,677 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.65 vs. limit=15.0 2023-10-11 15:35:06,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=752290.0, ans=0.0 2023-10-11 15:35:06,823 INFO [train.py:1031] (0/4) Epoch 12, batch 11000, loss[loss=0.1936, simple_loss=0.2945, pruned_loss=0.04637, over 16980.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2904, pruned_loss=0.05599, over 32627230.53 frames. ], batch size: 93, lr: 2.92e-03, grad_scale: 32.0 2023-10-11 15:35:23,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=752336.6666666666, ans=0.1 2023-10-11 15:35:25,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=752336.6666666666, ans=0.125 2023-10-11 15:35:25,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=752336.6666666666, ans=0.125 2023-10-11 15:35:28,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=752383.3333333334, ans=0.125 2023-10-11 15:35:28,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=752383.3333333334, ans=0.125 2023-10-11 15:35:43,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=752430.0, ans=0.125 2023-10-11 15:35:52,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.67 vs. limit=12.0 2023-10-11 15:35:57,933 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.799e+02 2.068e+02 2.258e+02 3.113e+02, threshold=4.137e+02, percent-clipped=0.0 2023-10-11 15:36:03,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=752523.3333333334, ans=0.125 2023-10-11 15:36:06,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=752523.3333333334, ans=0.125 2023-10-11 15:36:23,306 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:36:32,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=752616.6666666666, ans=0.0 2023-10-11 15:36:39,771 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.59 vs. limit=15.0 2023-10-11 15:36:50,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=752710.0, ans=0.0 2023-10-11 15:36:58,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=752710.0, ans=0.125 2023-10-11 15:37:15,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=752803.3333333334, ans=0.0 2023-10-11 15:37:31,558 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.24 vs. limit=15.0 2023-10-11 15:37:40,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=752896.6666666666, ans=0.5 2023-10-11 15:37:44,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=752896.6666666666, ans=0.2 2023-10-11 15:37:45,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=752896.6666666666, ans=0.125 2023-10-11 15:37:55,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=752943.3333333334, ans=0.2 2023-10-11 15:37:59,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=752943.3333333334, ans=0.125 2023-10-11 15:38:01,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=752943.3333333334, ans=0.2 2023-10-11 15:38:01,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.667e+02 1.871e+02 2.163e+02 3.458e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-11 15:38:04,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=752990.0, ans=0.125 2023-10-11 15:38:09,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=752990.0, ans=0.125 2023-10-11 15:38:15,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=753036.6666666666, ans=0.1 2023-10-11 15:38:21,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=753036.6666666666, ans=0.0 2023-10-11 15:38:30,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=753083.3333333334, ans=0.1 2023-10-11 15:38:33,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=753083.3333333334, ans=0.0 2023-10-11 15:38:47,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=753176.6666666666, ans=0.2 2023-10-11 15:38:55,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=753176.6666666666, ans=0.95 2023-10-11 15:39:09,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=753223.3333333334, ans=0.1 2023-10-11 15:39:10,291 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-10-11 15:39:12,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=753270.0, ans=0.2 2023-10-11 15:39:35,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=753363.3333333334, ans=0.1 2023-10-11 15:39:38,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=753363.3333333334, ans=0.125 2023-10-11 15:39:39,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-10-11 15:39:47,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=753410.0, ans=0.125 2023-10-11 15:39:50,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.640e+02 1.780e+02 1.959e+02 3.212e+02, threshold=3.560e+02, percent-clipped=0.0 2023-10-11 15:39:57,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=753456.6666666666, ans=0.125 2023-10-11 15:40:03,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=753456.6666666666, ans=0.1 2023-10-11 15:40:04,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=753503.3333333334, ans=0.04949747468305833 2023-10-11 15:40:06,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=753503.3333333334, ans=0.125 2023-10-11 15:40:19,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=753550.0, ans=0.5 2023-10-11 15:40:32,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=753596.6666666666, ans=0.125 2023-10-11 15:40:44,071 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-10-11 15:40:44,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=753643.3333333334, ans=0.125 2023-10-11 15:40:48,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=753643.3333333334, ans=0.125 2023-10-11 15:41:14,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=753736.6666666666, ans=0.0 2023-10-11 15:41:26,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=753783.3333333334, ans=0.09899494936611666 2023-10-11 15:41:35,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=753830.0, ans=0.125 2023-10-11 15:41:50,319 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.705e+02 1.886e+02 2.119e+02 3.337e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-11 15:41:56,129 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.50 vs. limit=15.0 2023-10-11 15:42:10,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2023-10-11 15:42:29,565 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2023-10-11 15:42:39,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=754110.0, ans=0.2 2023-10-11 15:42:52,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=754156.6666666666, ans=0.1 2023-10-11 15:43:20,556 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:43:29,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=754296.6666666666, ans=0.0 2023-10-11 15:43:30,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.77 vs. limit=22.5 2023-10-11 15:43:40,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=754343.3333333334, ans=0.2 2023-10-11 15:43:47,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.808e+02 2.019e+02 2.429e+02 3.277e+02, threshold=4.039e+02, percent-clipped=0.0 2023-10-11 15:43:58,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=754390.0, ans=0.09899494936611666 2023-10-11 15:43:59,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=754390.0, ans=0.125 2023-10-11 15:44:13,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=754483.3333333334, ans=0.125 2023-10-11 15:44:22,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=754483.3333333334, ans=0.0 2023-10-11 15:44:35,645 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=15.0 2023-10-11 15:44:46,870 INFO [train.py:1031] (0/4) Epoch 12, batch 11500, loss[loss=0.1912, simple_loss=0.282, pruned_loss=0.05019, over 16847.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.29, pruned_loss=0.05577, over 32659575.60 frames. ], batch size: 67, lr: 2.91e-03, grad_scale: 32.0 2023-10-11 15:45:00,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=754670.0, ans=0.2 2023-10-11 15:45:07,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=754670.0, ans=0.125 2023-10-11 15:45:10,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=754716.6666666666, ans=0.125 2023-10-11 15:45:25,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=754763.3333333334, ans=0.0 2023-10-11 15:45:27,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=754763.3333333334, ans=0.125 2023-10-11 15:45:36,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=754810.0, ans=0.0 2023-10-11 15:45:40,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.738e+02 1.978e+02 2.245e+02 3.164e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-11 15:45:58,479 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.03 vs. limit=22.5 2023-10-11 15:46:05,342 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-10-11 15:46:10,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=754950.0, ans=0.125 2023-10-11 15:46:10,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=754950.0, ans=0.2 2023-10-11 15:46:42,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=755043.3333333334, ans=0.5 2023-10-11 15:46:48,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=755090.0, ans=0.125 2023-10-11 15:46:54,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=755090.0, ans=0.0 2023-10-11 15:47:10,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=755183.3333333334, ans=0.125 2023-10-11 15:47:40,007 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.614e+02 1.757e+02 2.008e+02 2.651e+02, threshold=3.514e+02, percent-clipped=0.0 2023-10-11 15:47:57,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=755370.0, ans=0.125 2023-10-11 15:47:59,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=755370.0, ans=0.1 2023-10-11 15:48:03,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=12.0 2023-10-11 15:48:15,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=755463.3333333334, ans=0.125 2023-10-11 15:48:20,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=755463.3333333334, ans=0.0 2023-10-11 15:48:30,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=755510.0, ans=0.0 2023-10-11 15:49:06,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=755650.0, ans=0.125 2023-10-11 15:49:17,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.94 vs. limit=12.0 2023-10-11 15:49:33,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.686e+02 1.852e+02 2.113e+02 2.806e+02, threshold=3.704e+02, percent-clipped=0.0 2023-10-11 15:49:35,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=755790.0, ans=0.125 2023-10-11 15:49:51,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=755836.6666666666, ans=0.05 2023-10-11 15:50:05,272 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:50:05,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=755836.6666666666, ans=0.125 2023-10-11 15:50:29,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=755930.0, ans=0.0 2023-10-11 15:50:34,464 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.89 vs. limit=15.0 2023-10-11 15:50:44,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-10-11 15:50:55,643 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.61 vs. limit=22.5 2023-10-11 15:51:06,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-10-11 15:51:29,026 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 15:51:29,220 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.14 vs. limit=22.5 2023-10-11 15:51:40,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.661e+02 1.854e+02 2.117e+02 3.376e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-11 15:51:58,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=756303.3333333334, ans=0.125 2023-10-11 15:52:01,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=756303.3333333334, ans=0.2 2023-10-11 15:52:29,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=756396.6666666666, ans=0.0 2023-10-11 15:53:01,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=756536.6666666666, ans=0.125 2023-10-11 15:53:07,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=756583.3333333334, ans=0.0 2023-10-11 15:53:17,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.12 vs. limit=15.0 2023-10-11 15:53:26,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=756676.6666666666, ans=0.0 2023-10-11 15:53:32,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.73 vs. limit=15.0 2023-10-11 15:53:38,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.714e+02 1.865e+02 2.159e+02 3.232e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-11 15:53:59,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=756770.0, ans=0.2 2023-10-11 15:54:03,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=12.0 2023-10-11 15:54:14,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=756863.3333333334, ans=0.1 2023-10-11 15:54:17,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.90 vs. limit=15.0 2023-10-11 15:54:27,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756910.0, ans=0.1 2023-10-11 15:54:34,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=756956.6666666666, ans=0.1 2023-10-11 15:54:36,256 INFO [train.py:1031] (0/4) Epoch 12, batch 12000, loss[loss=0.1832, simple_loss=0.2815, pruned_loss=0.0425, over 16866.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2899, pruned_loss=0.05537, over 32707747.25 frames. ], batch size: 87, lr: 2.91e-03, grad_scale: 16.0 2023-10-11 15:54:42,847 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.47 vs. limit=12.0 2023-10-11 15:54:44,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=756956.6666666666, ans=10.0 2023-10-11 15:54:55,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=757003.3333333334, ans=0.015 2023-10-11 15:55:13,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=757096.6666666666, ans=0.0 2023-10-11 15:55:21,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=757143.3333333334, ans=0.0 2023-10-11 15:55:29,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=757143.3333333334, ans=22.5 2023-10-11 15:55:32,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.656e+02 1.798e+02 2.124e+02 3.725e+02, threshold=3.595e+02, percent-clipped=0.0 2023-10-11 15:55:47,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=757236.6666666666, ans=0.125 2023-10-11 15:55:56,325 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2023-10-11 15:55:59,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=757283.3333333334, ans=0.125 2023-10-11 15:56:07,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=757283.3333333334, ans=0.125 2023-10-11 15:56:21,305 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2023-10-11 15:56:23,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=757376.6666666666, ans=0.0 2023-10-11 15:56:30,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=757376.6666666666, ans=0.0 2023-10-11 15:56:34,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=757423.3333333334, ans=0.125 2023-10-11 15:56:41,613 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.82 vs. limit=15.0 2023-10-11 15:56:45,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=757470.0, ans=0.125 2023-10-11 15:56:51,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=757470.0, ans=0.2 2023-10-11 15:56:54,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=757516.6666666666, ans=0.125 2023-10-11 15:56:57,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=757516.6666666666, ans=0.1 2023-10-11 15:57:02,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=757516.6666666666, ans=6.0 2023-10-11 15:57:07,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=757563.3333333334, ans=0.125 2023-10-11 15:57:11,609 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.79 vs. limit=15.0 2023-10-11 15:57:24,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.637e+02 1.855e+02 2.003e+02 3.373e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-11 15:58:09,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=757843.3333333334, ans=15.0 2023-10-11 15:58:28,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=757936.6666666666, ans=0.1 2023-10-11 15:58:34,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=757936.6666666666, ans=0.2 2023-10-11 15:58:46,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=757983.3333333334, ans=0.125 2023-10-11 15:58:53,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=758030.0, ans=0.0 2023-10-11 15:58:56,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=758030.0, ans=0.125 2023-10-11 15:59:05,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=758076.6666666666, ans=0.1 2023-10-11 15:59:14,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.707e+02 1.867e+02 2.036e+02 2.749e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-11 15:59:18,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=758123.3333333334, ans=0.0 2023-10-11 15:59:20,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=758123.3333333334, ans=0.0 2023-10-11 15:59:24,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=758170.0, ans=0.1 2023-10-11 15:59:25,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.15 vs. limit=15.0 2023-10-11 15:59:52,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=758263.3333333334, ans=0.125 2023-10-11 16:00:18,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=758356.6666666666, ans=0.1 2023-10-11 16:00:30,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=758403.3333333334, ans=0.0 2023-10-11 16:00:35,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=758450.0, ans=0.125 2023-10-11 16:00:49,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=758496.6666666666, ans=0.125 2023-10-11 16:01:06,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.696e+02 1.878e+02 2.108e+02 3.119e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-11 16:01:25,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=758636.6666666666, ans=0.1 2023-10-11 16:01:29,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=758636.6666666666, ans=0.125 2023-10-11 16:02:07,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.52 vs. limit=15.0 2023-10-11 16:02:27,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=758870.0, ans=0.2 2023-10-11 16:02:59,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=759010.0, ans=0.2 2023-10-11 16:03:04,412 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.735e+02 1.877e+02 2.156e+02 3.156e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-11 16:03:11,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=15.0 2023-10-11 16:03:14,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=759103.3333333334, ans=0.0 2023-10-11 16:03:29,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=759150.0, ans=0.125 2023-10-11 16:03:29,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=759150.0, ans=0.125 2023-10-11 16:03:31,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=759150.0, ans=0.125 2023-10-11 16:03:35,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=759150.0, ans=0.125 2023-10-11 16:04:00,925 INFO [train.py:1031] (0/4) Epoch 12, batch 12500, loss[loss=0.1925, simple_loss=0.2876, pruned_loss=0.04865, over 16807.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2895, pruned_loss=0.05533, over 32730035.16 frames. ], batch size: 175, lr: 2.91e-03, grad_scale: 16.0 2023-10-11 16:04:06,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=759290.0, ans=0.0 2023-10-11 16:04:35,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=759430.0, ans=0.2 2023-10-11 16:04:40,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=759430.0, ans=0.0 2023-10-11 16:04:49,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.13 vs. limit=22.5 2023-10-11 16:04:50,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=759476.6666666666, ans=0.0 2023-10-11 16:04:55,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.670e+02 1.844e+02 2.059e+02 2.894e+02, threshold=3.688e+02, percent-clipped=0.0 2023-10-11 16:04:56,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=12.0 2023-10-11 16:05:19,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.64 vs. limit=10.0 2023-10-11 16:05:23,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=759616.6666666666, ans=0.0 2023-10-11 16:05:27,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-10-11 16:05:29,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=759663.3333333334, ans=0.125 2023-10-11 16:05:29,412 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.19 vs. limit=15.0 2023-10-11 16:05:31,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.48 vs. limit=10.0 2023-10-11 16:05:45,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=759710.0, ans=0.1 2023-10-11 16:05:48,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=759710.0, ans=0.2 2023-10-11 16:05:50,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=759756.6666666666, ans=0.04949747468305833 2023-10-11 16:06:08,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=759803.3333333334, ans=0.125 2023-10-11 16:06:35,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=759943.3333333334, ans=0.025 2023-10-11 16:06:41,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=759943.3333333334, ans=0.2 2023-10-11 16:06:45,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=759990.0, ans=0.0 2023-10-11 16:06:46,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.668e+02 1.891e+02 2.120e+02 2.902e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-11 16:07:01,964 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.99 vs. limit=22.5 2023-10-11 16:07:02,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=760036.6666666666, ans=0.2 2023-10-11 16:07:05,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=760036.6666666666, ans=0.0 2023-10-11 16:07:19,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=760083.3333333334, ans=0.125 2023-10-11 16:07:25,495 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.11 vs. limit=22.5 2023-10-11 16:07:36,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=760176.6666666666, ans=0.125 2023-10-11 16:07:39,762 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-10-11 16:07:48,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=760223.3333333334, ans=0.0 2023-10-11 16:07:59,608 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.03 vs. limit=22.5 2023-10-11 16:08:11,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=760316.6666666666, ans=0.1 2023-10-11 16:08:15,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=760316.6666666666, ans=0.0 2023-10-11 16:08:17,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=760316.6666666666, ans=0.0 2023-10-11 16:08:18,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=760363.3333333334, ans=0.125 2023-10-11 16:08:25,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=760363.3333333334, ans=0.0 2023-10-11 16:08:40,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.712e+02 1.879e+02 2.129e+02 3.461e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-11 16:08:46,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.12 vs. limit=22.5 2023-10-11 16:08:54,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=760503.3333333334, ans=0.125 2023-10-11 16:09:09,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=15.0 2023-10-11 16:09:15,637 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:09:16,671 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.28 vs. limit=10.0 2023-10-11 16:09:31,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=760643.3333333334, ans=0.2 2023-10-11 16:10:01,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=760783.3333333334, ans=0.07 2023-10-11 16:10:24,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=760876.6666666666, ans=0.1 2023-10-11 16:10:32,366 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.727e+02 1.956e+02 2.282e+02 3.159e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-11 16:10:34,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=760923.3333333334, ans=0.125 2023-10-11 16:11:09,653 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2023-10-11 16:11:13,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=761063.3333333334, ans=0.2 2023-10-11 16:11:38,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=761156.6666666666, ans=0.1 2023-10-11 16:11:54,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=761250.0, ans=0.0 2023-10-11 16:12:09,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=761296.6666666666, ans=0.125 2023-10-11 16:12:24,468 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.83 vs. limit=15.0 2023-10-11 16:12:26,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.02 vs. limit=6.0 2023-10-11 16:12:27,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.632e+02 1.753e+02 1.957e+02 2.587e+02, threshold=3.505e+02, percent-clipped=0.0 2023-10-11 16:12:33,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=761390.0, ans=6.0 2023-10-11 16:12:43,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=761436.6666666666, ans=0.1 2023-10-11 16:12:57,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-10-11 16:13:03,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-11 16:13:19,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.45 vs. limit=22.5 2023-10-11 16:13:19,733 INFO [train.py:1031] (0/4) Epoch 12, batch 13000, loss[loss=0.2006, simple_loss=0.2867, pruned_loss=0.05729, over 16941.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2902, pruned_loss=0.0555, over 32748460.21 frames. ], batch size: 165, lr: 2.90e-03, grad_scale: 16.0 2023-10-11 16:13:56,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.85 vs. limit=10.0 2023-10-11 16:14:04,784 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.72 vs. limit=15.0 2023-10-11 16:14:14,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.12 vs. limit=15.0 2023-10-11 16:14:27,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.731e+02 1.935e+02 2.290e+02 3.397e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-11 16:14:40,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=761903.3333333334, ans=0.5 2023-10-11 16:14:47,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=761903.3333333334, ans=0.2 2023-10-11 16:14:48,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.95 vs. limit=22.5 2023-10-11 16:15:05,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=761996.6666666666, ans=0.125 2023-10-11 16:15:25,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=762090.0, ans=0.95 2023-10-11 16:15:32,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=762090.0, ans=0.2 2023-10-11 16:15:34,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=762136.6666666666, ans=0.0 2023-10-11 16:15:47,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=762183.3333333334, ans=0.05 2023-10-11 16:15:50,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=762183.3333333334, ans=0.1 2023-10-11 16:15:55,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=762183.3333333334, ans=0.125 2023-10-11 16:16:05,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.00 vs. limit=15.0 2023-10-11 16:16:11,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=762276.6666666666, ans=0.125 2023-10-11 16:16:20,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.652e+02 1.832e+02 2.054e+02 3.339e+02, threshold=3.663e+02, percent-clipped=0.0 2023-10-11 16:16:31,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=762370.0, ans=0.2 2023-10-11 16:16:31,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=12.0 2023-10-11 16:16:43,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=762416.6666666666, ans=0.04949747468305833 2023-10-11 16:17:07,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=762510.0, ans=0.125 2023-10-11 16:17:08,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=762510.0, ans=0.1 2023-10-11 16:17:18,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=762556.6666666666, ans=0.0 2023-10-11 16:17:29,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-10-11 16:17:32,433 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:17:33,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=762603.3333333334, ans=0.2 2023-10-11 16:17:39,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=762603.3333333334, ans=0.125 2023-10-11 16:17:39,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.92 vs. limit=15.0 2023-10-11 16:18:08,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=762743.3333333334, ans=0.125 2023-10-11 16:18:17,073 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.696e+02 1.832e+02 2.062e+02 3.018e+02, threshold=3.665e+02, percent-clipped=0.0 2023-10-11 16:18:17,927 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=12.97 vs. limit=15.0 2023-10-11 16:18:31,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=762836.6666666666, ans=0.0 2023-10-11 16:18:50,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=762930.0, ans=0.125 2023-10-11 16:19:01,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=762976.6666666666, ans=0.1 2023-10-11 16:19:19,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=763023.3333333334, ans=10.0 2023-10-11 16:19:32,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=763070.0, ans=0.125 2023-10-11 16:19:42,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=763116.6666666666, ans=0.1 2023-10-11 16:20:07,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.771e+02 1.947e+02 2.146e+02 2.873e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-11 16:20:25,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=763303.3333333334, ans=0.2 2023-10-11 16:20:25,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=763303.3333333334, ans=0.0 2023-10-11 16:20:58,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=763443.3333333334, ans=0.125 2023-10-11 16:20:58,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=763443.3333333334, ans=0.2 2023-10-11 16:21:02,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=763490.0, ans=0.1 2023-10-11 16:21:11,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=763490.0, ans=22.5 2023-10-11 16:21:32,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=763583.3333333334, ans=0.0 2023-10-11 16:21:42,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.97 vs. limit=12.0 2023-10-11 16:21:58,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=763676.6666666666, ans=0.0 2023-10-11 16:22:02,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.708e+02 1.909e+02 2.135e+02 2.881e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-11 16:22:37,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=763863.3333333334, ans=0.0 2023-10-11 16:22:46,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=763910.0, ans=0.125 2023-10-11 16:22:56,456 INFO [train.py:1031] (0/4) Epoch 12, batch 13500, loss[loss=0.1896, simple_loss=0.2584, pruned_loss=0.06044, over 12623.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2895, pruned_loss=0.05534, over 32740230.82 frames. ], batch size: 440, lr: 2.90e-03, grad_scale: 32.0 2023-10-11 16:22:58,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=763956.6666666666, ans=0.0 2023-10-11 16:23:10,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=764003.3333333334, ans=0.2 2023-10-11 16:23:20,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=764050.0, ans=0.125 2023-10-11 16:23:40,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.34 vs. limit=15.0 2023-10-11 16:23:52,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.734e+02 1.927e+02 2.347e+02 3.260e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-11 16:24:07,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=764236.6666666666, ans=0.125 2023-10-11 16:24:10,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=764236.6666666666, ans=0.125 2023-10-11 16:24:11,122 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-10-11 16:24:15,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=764283.3333333334, ans=0.0 2023-10-11 16:24:16,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=764283.3333333334, ans=0.1 2023-10-11 16:24:27,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=764330.0, ans=0.125 2023-10-11 16:24:28,237 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:24:29,733 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-10-11 16:24:43,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=764376.6666666666, ans=0.125 2023-10-11 16:24:43,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.34 vs. limit=22.5 2023-10-11 16:24:59,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=764470.0, ans=0.0 2023-10-11 16:25:01,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=764470.0, ans=0.125 2023-10-11 16:25:10,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=764516.6666666666, ans=0.125 2023-10-11 16:25:32,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=764610.0, ans=0.2 2023-10-11 16:25:32,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=764610.0, ans=0.125 2023-10-11 16:25:37,556 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.732e+02 1.901e+02 2.142e+02 3.181e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-11 16:25:38,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=764656.6666666666, ans=0.125 2023-10-11 16:25:45,001 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-12.pt 2023-10-11 16:26:16,741 INFO [train.py:1031] (0/4) Epoch 13, batch 0, loss[loss=0.1578, simple_loss=0.2503, pruned_loss=0.03268, over 16868.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.2503, pruned_loss=0.03268, over 16868.00 frames. ], batch size: 116, lr: 2.77e-03, grad_scale: 32.0 2023-10-11 16:26:16,742 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-11 16:26:24,452 INFO [train.py:1063] (0/4) Epoch 13, validation: loss=0.2183, simple_loss=0.306, pruned_loss=0.06527, over 1020973.00 frames. 2023-10-11 16:26:24,453 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-11 16:26:39,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=764731.3333333334, ans=0.0 2023-10-11 16:26:53,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=764778.0, ans=0.09899494936611666 2023-10-11 16:26:54,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=764778.0, ans=0.125 2023-10-11 16:26:56,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=15.0 2023-10-11 16:27:07,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=764824.6666666666, ans=0.1 2023-10-11 16:27:13,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=764871.3333333334, ans=0.0 2023-10-11 16:27:23,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=764918.0, ans=0.2 2023-10-11 16:27:45,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=764964.6666666666, ans=0.125 2023-10-11 16:27:45,701 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.15 vs. limit=22.5 2023-10-11 16:27:50,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=765011.3333333334, ans=0.1 2023-10-11 16:28:12,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.708e+02 1.952e+02 2.399e+02 5.505e+02, threshold=3.905e+02, percent-clipped=4.0 2023-10-11 16:28:13,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=765104.6666666666, ans=0.1 2023-10-11 16:28:20,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=765151.3333333334, ans=0.95 2023-10-11 16:28:24,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=765151.3333333334, ans=0.125 2023-10-11 16:28:25,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=765151.3333333334, ans=0.0 2023-10-11 16:28:28,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=765198.0, ans=0.0 2023-10-11 16:28:30,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=765198.0, ans=0.0 2023-10-11 16:28:35,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=765198.0, ans=0.0 2023-10-11 16:28:39,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=765244.6666666666, ans=0.125 2023-10-11 16:28:41,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=765244.6666666666, ans=0.125 2023-10-11 16:28:58,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=765291.3333333334, ans=0.1 2023-10-11 16:29:07,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=765338.0, ans=0.125 2023-10-11 16:29:12,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=765338.0, ans=0.1 2023-10-11 16:29:12,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=765338.0, ans=0.1 2023-10-11 16:29:18,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=765384.6666666666, ans=0.125 2023-10-11 16:29:43,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=765478.0, ans=0.0 2023-10-11 16:29:52,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=765524.6666666666, ans=0.125 2023-10-11 16:30:02,072 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.258e+02 1.643e+02 1.767e+02 1.891e+02 2.393e+02, threshold=3.534e+02, percent-clipped=0.0 2023-10-11 16:30:09,839 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:30:18,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=765664.6666666666, ans=0.125 2023-10-11 16:30:29,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=765711.3333333334, ans=0.0 2023-10-11 16:30:31,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=765711.3333333334, ans=0.125 2023-10-11 16:30:32,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=765711.3333333334, ans=0.1 2023-10-11 16:30:32,201 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:31:13,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=765851.3333333334, ans=0.0 2023-10-11 16:31:19,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=765898.0, ans=10.0 2023-10-11 16:31:21,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=765898.0, ans=0.125 2023-10-11 16:31:29,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=765944.6666666666, ans=0.125 2023-10-11 16:31:31,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=765944.6666666666, ans=0.0 2023-10-11 16:31:36,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=765944.6666666666, ans=0.1 2023-10-11 16:31:45,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=765991.3333333334, ans=0.0 2023-10-11 16:31:45,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=765991.3333333334, ans=0.125 2023-10-11 16:31:48,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=765991.3333333334, ans=0.025 2023-10-11 16:31:55,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=766038.0, ans=0.125 2023-10-11 16:31:55,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.676e+02 1.842e+02 2.223e+02 3.061e+02, threshold=3.685e+02, percent-clipped=0.0 2023-10-11 16:32:05,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=766084.6666666666, ans=0.125 2023-10-11 16:32:26,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=766178.0, ans=0.125 2023-10-11 16:33:00,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=766318.0, ans=0.2 2023-10-11 16:33:18,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=766411.3333333334, ans=0.0 2023-10-11 16:33:23,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.83 vs. limit=15.0 2023-10-11 16:33:36,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=766504.6666666666, ans=0.2 2023-10-11 16:33:41,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=766504.6666666666, ans=0.05 2023-10-11 16:33:43,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.313e+02 1.732e+02 1.917e+02 2.155e+02 3.157e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-11 16:33:43,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=766504.6666666666, ans=0.0 2023-10-11 16:33:47,353 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-10-11 16:33:49,305 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.86 vs. limit=15.0 2023-10-11 16:34:06,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=766598.0, ans=0.125 2023-10-11 16:34:09,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=766598.0, ans=0.125 2023-10-11 16:34:11,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=766644.6666666666, ans=0.125 2023-10-11 16:34:34,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=766738.0, ans=0.1 2023-10-11 16:34:35,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=766738.0, ans=0.1 2023-10-11 16:34:41,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=766738.0, ans=0.95 2023-10-11 16:34:52,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=766784.6666666666, ans=0.1 2023-10-11 16:34:59,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-10-11 16:35:15,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=766878.0, ans=0.125 2023-10-11 16:35:18,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=766878.0, ans=0.125 2023-10-11 16:35:39,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.777e+02 1.942e+02 2.286e+02 3.304e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-11 16:35:43,057 INFO [train.py:1031] (0/4) Epoch 13, batch 500, loss[loss=0.1836, simple_loss=0.2763, pruned_loss=0.04545, over 16827.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2884, pruned_loss=0.05539, over 7251801.39 frames. ], batch size: 155, lr: 2.77e-03, grad_scale: 32.0 2023-10-11 16:35:54,084 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.84 vs. limit=15.0 2023-10-11 16:36:03,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=767064.6666666666, ans=0.0 2023-10-11 16:36:03,540 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:36:06,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.56 vs. limit=15.0 2023-10-11 16:36:11,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=767111.3333333334, ans=0.1 2023-10-11 16:36:24,694 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=15.0 2023-10-11 16:37:14,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=767391.3333333334, ans=0.125 2023-10-11 16:37:26,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=767438.0, ans=0.0 2023-10-11 16:37:31,144 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.779e+02 2.034e+02 2.282e+02 3.435e+02, threshold=4.068e+02, percent-clipped=0.0 2023-10-11 16:37:51,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=767531.3333333334, ans=0.125 2023-10-11 16:38:03,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=767578.0, ans=0.0 2023-10-11 16:38:07,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=767578.0, ans=0.1 2023-10-11 16:38:12,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=767624.6666666666, ans=0.02 2023-10-11 16:38:14,702 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.40 vs. limit=6.0 2023-10-11 16:38:21,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=767671.3333333334, ans=0.125 2023-10-11 16:38:24,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=767671.3333333334, ans=0.0 2023-10-11 16:38:27,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=767671.3333333334, ans=0.125 2023-10-11 16:38:30,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=767671.3333333334, ans=0.0 2023-10-11 16:38:32,936 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:38:35,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=767718.0, ans=0.125 2023-10-11 16:38:45,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=767764.6666666666, ans=0.2 2023-10-11 16:38:51,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=767764.6666666666, ans=0.0 2023-10-11 16:38:54,262 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.34 vs. limit=15.0 2023-10-11 16:39:00,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.37 vs. limit=15.0 2023-10-11 16:39:09,131 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 16:39:14,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=767904.6666666666, ans=0.125 2023-10-11 16:39:14,849 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-10-11 16:39:15,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=767904.6666666666, ans=0.035 2023-10-11 16:39:21,075 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.784e+02 1.927e+02 2.115e+02 2.737e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-11 16:39:35,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=767951.3333333334, ans=0.2 2023-10-11 16:39:36,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=767998.0, ans=0.1 2023-10-11 16:39:40,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=767998.0, ans=0.0 2023-10-11 16:40:02,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=768091.3333333334, ans=0.125 2023-10-11 16:40:29,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=768184.6666666666, ans=0.2 2023-10-11 16:40:30,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=768184.6666666666, ans=0.125 2023-10-11 16:40:37,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.89 vs. limit=15.0 2023-10-11 16:40:49,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=768278.0, ans=0.0 2023-10-11 16:40:55,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=768278.0, ans=0.04949747468305833 2023-10-11 16:40:58,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=768324.6666666666, ans=0.1 2023-10-11 16:41:04,885 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-10-11 16:41:17,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.733e+02 1.811e+02 2.071e+02 2.964e+02, threshold=3.623e+02, percent-clipped=0.0 2023-10-11 16:41:21,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.42 vs. limit=15.0 2023-10-11 16:41:24,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=768418.0, ans=0.015 2023-10-11 16:42:08,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.09 vs. limit=15.0 2023-10-11 16:42:12,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=768604.6666666666, ans=0.0 2023-10-11 16:42:17,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=768604.6666666666, ans=0.125 2023-10-11 16:42:20,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=768604.6666666666, ans=0.2 2023-10-11 16:42:30,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=768651.3333333334, ans=0.125 2023-10-11 16:42:39,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=768651.3333333334, ans=0.0 2023-10-11 16:43:09,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=768791.3333333334, ans=0.1 2023-10-11 16:43:16,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=768838.0, ans=0.125 2023-10-11 16:43:23,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.719e+02 1.883e+02 2.125e+02 2.964e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-11 16:43:24,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=768838.0, ans=0.125 2023-10-11 16:43:27,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=768884.6666666666, ans=0.07 2023-10-11 16:43:56,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=768978.0, ans=0.125 2023-10-11 16:44:21,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=769071.3333333334, ans=0.2 2023-10-11 16:44:43,370 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.39 vs. limit=12.0 2023-10-11 16:44:52,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=769211.3333333334, ans=0.0 2023-10-11 16:44:56,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=769211.3333333334, ans=0.1 2023-10-11 16:45:00,949 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-10-11 16:45:07,930 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.45 vs. limit=15.0 2023-10-11 16:45:11,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=769304.6666666666, ans=0.125 2023-10-11 16:45:14,030 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=22.5 2023-10-11 16:45:16,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=769304.6666666666, ans=0.07 2023-10-11 16:45:18,610 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.666e+02 1.805e+02 1.947e+02 2.979e+02, threshold=3.611e+02, percent-clipped=0.0 2023-10-11 16:45:21,436 INFO [train.py:1031] (0/4) Epoch 13, batch 1000, loss[loss=0.1804, simple_loss=0.2784, pruned_loss=0.04119, over 16937.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2896, pruned_loss=0.05558, over 12904571.62 frames. ], batch size: 123, lr: 2.76e-03, grad_scale: 32.0 2023-10-11 16:45:35,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=769398.0, ans=0.2 2023-10-11 16:45:38,469 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=12.0 2023-10-11 16:45:42,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=769444.6666666666, ans=0.125 2023-10-11 16:45:51,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=769444.6666666666, ans=0.125 2023-10-11 16:45:53,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=769491.3333333334, ans=0.125 2023-10-11 16:45:55,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=769491.3333333334, ans=0.1 2023-10-11 16:45:58,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=769491.3333333334, ans=0.125 2023-10-11 16:46:37,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=769678.0, ans=0.0 2023-10-11 16:46:57,923 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.79 vs. limit=15.0 2023-10-11 16:47:06,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.690e+02 1.909e+02 2.091e+02 2.792e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-11 16:47:11,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=769818.0, ans=0.0 2023-10-11 16:47:48,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-10-11 16:47:56,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=769958.0, ans=0.125 2023-10-11 16:48:11,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=770004.6666666666, ans=15.0 2023-10-11 16:48:34,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=770098.0, ans=0.0 2023-10-11 16:48:47,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=770144.6666666666, ans=0.125 2023-10-11 16:49:00,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=22.5 2023-10-11 16:49:11,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=770238.0, ans=0.1 2023-10-11 16:49:14,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.646e+02 1.802e+02 2.045e+02 3.030e+02, threshold=3.603e+02, percent-clipped=0.0 2023-10-11 16:49:22,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=770284.6666666666, ans=0.2 2023-10-11 16:49:32,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=770331.3333333334, ans=0.125 2023-10-11 16:49:44,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=770378.0, ans=0.125 2023-10-11 16:49:50,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=770424.6666666666, ans=0.2 2023-10-11 16:50:09,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=770471.3333333334, ans=0.5 2023-10-11 16:50:37,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=770611.3333333334, ans=0.125 2023-10-11 16:50:38,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=770611.3333333334, ans=0.125 2023-10-11 16:50:46,609 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.93 vs. limit=15.0 2023-10-11 16:50:58,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=770704.6666666666, ans=0.125 2023-10-11 16:51:03,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.40 vs. limit=15.0 2023-10-11 16:51:04,421 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.660e+02 1.834e+02 2.035e+02 3.154e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-11 16:51:08,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=12.0 2023-10-11 16:51:21,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=770798.0, ans=10.0 2023-10-11 16:51:29,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=770798.0, ans=0.1 2023-10-11 16:51:33,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.57 vs. limit=22.5 2023-10-11 16:52:07,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=770984.6666666666, ans=0.125 2023-10-11 16:52:14,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=771031.3333333334, ans=0.125 2023-10-11 16:52:34,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=771078.0, ans=0.125 2023-10-11 16:52:48,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=771171.3333333334, ans=0.02 2023-10-11 16:52:57,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.296e+02 1.752e+02 1.889e+02 2.193e+02 3.124e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-11 16:52:59,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=771218.0, ans=0.125 2023-10-11 16:53:24,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=771311.3333333334, ans=0.125 2023-10-11 16:53:28,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.68 vs. limit=5.0 2023-10-11 16:53:34,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=771358.0, ans=0.125 2023-10-11 16:53:56,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=771451.3333333334, ans=0.0 2023-10-11 16:54:15,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=771498.0, ans=0.125 2023-10-11 16:54:16,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=771498.0, ans=0.125 2023-10-11 16:54:47,718 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.71 vs. limit=15.0 2023-10-11 16:54:51,490 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.260e+02 1.701e+02 1.885e+02 2.028e+02 2.925e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-11 16:54:54,104 INFO [train.py:1031] (0/4) Epoch 13, batch 1500, loss[loss=0.2023, simple_loss=0.2948, pruned_loss=0.05488, over 16907.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2884, pruned_loss=0.05465, over 17328925.72 frames. ], batch size: 165, lr: 2.76e-03, grad_scale: 32.0 2023-10-11 16:54:59,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=771684.6666666666, ans=0.125 2023-10-11 16:55:01,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.92 vs. limit=22.5 2023-10-11 16:55:33,229 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.37 vs. limit=12.0 2023-10-11 16:55:34,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.33 vs. limit=15.0 2023-10-11 16:56:15,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=772011.3333333334, ans=0.1 2023-10-11 16:56:17,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=772011.3333333334, ans=0.125 2023-10-11 16:56:43,125 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.716e+02 1.889e+02 2.098e+02 3.219e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-11 16:56:46,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=772151.3333333334, ans=0.125 2023-10-11 16:57:02,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=772198.0, ans=0.125 2023-10-11 16:57:13,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=772244.6666666666, ans=0.125 2023-10-11 16:57:35,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-10-11 16:57:41,728 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.66 vs. limit=15.0 2023-10-11 16:57:56,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=772384.6666666666, ans=0.125 2023-10-11 16:58:06,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=772431.3333333334, ans=0.2 2023-10-11 16:58:17,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=772478.0, ans=0.0 2023-10-11 16:58:42,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.617e+02 1.832e+02 2.084e+02 3.024e+02, threshold=3.663e+02, percent-clipped=0.0 2023-10-11 16:58:46,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=772618.0, ans=0.125 2023-10-11 16:58:58,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=772664.6666666666, ans=0.1 2023-10-11 16:59:14,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-10-11 16:59:15,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=772758.0, ans=0.125 2023-10-11 16:59:44,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=772851.3333333334, ans=0.125 2023-10-11 17:00:03,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=772944.6666666666, ans=0.0 2023-10-11 17:00:10,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=772991.3333333334, ans=0.125 2023-10-11 17:00:11,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=772991.3333333334, ans=0.2 2023-10-11 17:00:22,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=772991.3333333334, ans=10.0 2023-10-11 17:00:32,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.683e+02 1.885e+02 2.125e+02 3.193e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-11 17:00:36,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=773084.6666666666, ans=0.125 2023-10-11 17:00:57,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=773178.0, ans=0.0 2023-10-11 17:01:02,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=773178.0, ans=0.0 2023-10-11 17:01:31,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-10-11 17:01:38,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=773318.0, ans=0.125 2023-10-11 17:01:58,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=773411.3333333334, ans=0.0 2023-10-11 17:02:03,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.95 vs. limit=10.0 2023-10-11 17:02:04,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=773458.0, ans=0.125 2023-10-11 17:02:22,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.681e+02 1.884e+02 2.076e+02 2.770e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-11 17:02:23,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=773551.3333333334, ans=0.125 2023-10-11 17:02:35,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-10-11 17:02:46,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=773644.6666666666, ans=0.125 2023-10-11 17:02:54,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.05 vs. limit=15.0 2023-10-11 17:03:27,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-10-11 17:03:58,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=773878.0, ans=0.125 2023-10-11 17:04:03,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=773878.0, ans=0.1 2023-10-11 17:04:06,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=773924.6666666666, ans=0.2 2023-10-11 17:04:22,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=773971.3333333334, ans=0.0 2023-10-11 17:04:27,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=773971.3333333334, ans=0.125 2023-10-11 17:04:28,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.638e+02 1.768e+02 2.069e+02 2.955e+02, threshold=3.535e+02, percent-clipped=0.0 2023-10-11 17:04:30,858 INFO [train.py:1031] (0/4) Epoch 13, batch 2000, loss[loss=0.1913, simple_loss=0.2874, pruned_loss=0.04765, over 16065.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2889, pruned_loss=0.05477, over 20732807.53 frames. ], batch size: 43, lr: 2.76e-03, grad_scale: 32.0 2023-10-11 17:04:45,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=774064.6666666666, ans=0.1 2023-10-11 17:05:11,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=774158.0, ans=0.2 2023-10-11 17:05:28,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=774204.6666666666, ans=0.125 2023-10-11 17:05:30,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=774204.6666666666, ans=0.125 2023-10-11 17:05:30,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=774204.6666666666, ans=0.125 2023-10-11 17:05:41,074 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=22.5 2023-10-11 17:05:42,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=774251.3333333334, ans=0.125 2023-10-11 17:05:55,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=774298.0, ans=0.0 2023-10-11 17:06:10,564 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-10-11 17:06:38,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=774438.0, ans=0.0 2023-10-11 17:06:38,845 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-10-11 17:06:44,499 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.299e+02 1.680e+02 1.868e+02 2.098e+02 3.054e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-11 17:06:57,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=774484.6666666666, ans=0.1 2023-10-11 17:07:01,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=774484.6666666666, ans=0.2 2023-10-11 17:07:27,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=774578.0, ans=0.0 2023-10-11 17:07:45,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=774624.6666666666, ans=0.125 2023-10-11 17:08:26,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=774764.6666666666, ans=0.125 2023-10-11 17:08:40,029 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.44 vs. limit=22.5 2023-10-11 17:08:50,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=774858.0, ans=0.0 2023-10-11 17:08:56,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=774858.0, ans=10.0 2023-10-11 17:08:57,826 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.51 vs. limit=10.0 2023-10-11 17:09:09,197 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.42 vs. limit=15.0 2023-10-11 17:09:10,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.747e+02 1.980e+02 2.145e+02 2.940e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-11 17:09:32,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=775044.6666666666, ans=0.0 2023-10-11 17:09:36,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=775044.6666666666, ans=0.0 2023-10-11 17:09:54,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=775138.0, ans=0.125 2023-10-11 17:09:54,445 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.54 vs. limit=22.5 2023-10-11 17:09:57,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=775138.0, ans=0.125 2023-10-11 17:09:58,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.60 vs. limit=22.5 2023-10-11 17:10:08,430 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.04 vs. limit=22.5 2023-10-11 17:10:39,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=775324.6666666666, ans=0.2 2023-10-11 17:10:49,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=775371.3333333334, ans=0.035 2023-10-11 17:10:53,974 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.53 vs. limit=15.0 2023-10-11 17:10:54,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=775371.3333333334, ans=0.015 2023-10-11 17:10:57,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.726e+02 1.937e+02 2.179e+02 3.134e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-11 17:11:02,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=775418.0, ans=0.0 2023-10-11 17:11:18,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=775464.6666666666, ans=0.0 2023-10-11 17:11:27,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=775511.3333333334, ans=0.1 2023-10-11 17:12:06,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=775698.0, ans=0.0 2023-10-11 17:12:12,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=775698.0, ans=0.125 2023-10-11 17:12:32,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=775791.3333333334, ans=0.125 2023-10-11 17:12:34,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=775791.3333333334, ans=0.125 2023-10-11 17:12:37,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=775791.3333333334, ans=0.0 2023-10-11 17:12:50,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.735e+02 1.913e+02 2.191e+02 3.165e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-11 17:12:57,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=775884.6666666666, ans=0.0 2023-10-11 17:13:01,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=775884.6666666666, ans=0.1 2023-10-11 17:13:14,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=775978.0, ans=0.0 2023-10-11 17:13:22,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.24 vs. limit=22.5 2023-10-11 17:13:26,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=776024.6666666666, ans=0.0 2023-10-11 17:13:48,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-10-11 17:13:57,346 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=22.5 2023-10-11 17:14:01,151 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.71 vs. limit=22.5 2023-10-11 17:14:08,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=776211.3333333334, ans=0.0 2023-10-11 17:14:11,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=776211.3333333334, ans=0.125 2023-10-11 17:14:22,078 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.58 vs. limit=15.0 2023-10-11 17:14:27,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=776258.0, ans=0.0 2023-10-11 17:14:36,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.56 vs. limit=15.0 2023-10-11 17:14:41,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.730e+02 1.888e+02 2.296e+02 3.046e+02, threshold=3.776e+02, percent-clipped=0.0 2023-10-11 17:14:42,829 INFO [train.py:1031] (0/4) Epoch 13, batch 2500, loss[loss=0.2025, simple_loss=0.2951, pruned_loss=0.05492, over 16799.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2894, pruned_loss=0.05536, over 23381126.41 frames. ], batch size: 188, lr: 2.75e-03, grad_scale: 32.0 2023-10-11 17:14:49,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=776351.3333333334, ans=0.0 2023-10-11 17:14:58,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=776398.0, ans=0.125 2023-10-11 17:15:14,796 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.71 vs. limit=15.0 2023-10-11 17:15:18,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=776491.3333333334, ans=0.1 2023-10-11 17:15:49,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=776631.3333333334, ans=0.0 2023-10-11 17:15:56,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=776631.3333333334, ans=0.125 2023-10-11 17:15:59,408 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.04 vs. limit=10.0 2023-10-11 17:16:07,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=776678.0, ans=0.125 2023-10-11 17:16:20,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=776724.6666666666, ans=0.125 2023-10-11 17:16:31,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=776771.3333333334, ans=0.025 2023-10-11 17:16:32,239 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.751e+02 1.965e+02 2.216e+02 2.740e+02, threshold=3.931e+02, percent-clipped=0.0 2023-10-11 17:16:32,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=776818.0, ans=0.07 2023-10-11 17:16:54,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.06 vs. limit=15.0 2023-10-11 17:17:01,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=776911.3333333334, ans=0.0 2023-10-11 17:17:10,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=776958.0, ans=0.125 2023-10-11 17:17:13,277 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.15 vs. limit=15.0 2023-10-11 17:17:16,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.66 vs. limit=15.0 2023-10-11 17:17:28,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=777004.6666666666, ans=0.0 2023-10-11 17:17:30,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=777051.3333333334, ans=0.0 2023-10-11 17:17:31,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=777051.3333333334, ans=0.125 2023-10-11 17:17:50,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.26 vs. limit=15.0 2023-10-11 17:18:09,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=777191.3333333334, ans=0.2 2023-10-11 17:18:25,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=777238.0, ans=0.125 2023-10-11 17:18:33,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=777238.0, ans=0.125 2023-10-11 17:18:35,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.672e+02 1.796e+02 2.047e+02 3.044e+02, threshold=3.593e+02, percent-clipped=0.0 2023-10-11 17:18:42,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.06 vs. limit=15.0 2023-10-11 17:18:59,923 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:19:17,715 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-10-11 17:19:18,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=777424.6666666666, ans=0.07 2023-10-11 17:19:49,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=777518.0, ans=0.0 2023-10-11 17:19:54,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=777564.6666666666, ans=0.125 2023-10-11 17:19:55,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=777564.6666666666, ans=0.015 2023-10-11 17:20:00,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=777564.6666666666, ans=0.125 2023-10-11 17:20:03,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=777611.3333333334, ans=0.125 2023-10-11 17:20:23,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=777658.0, ans=0.0 2023-10-11 17:20:24,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=777658.0, ans=0.035 2023-10-11 17:20:33,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777704.6666666666, ans=0.1 2023-10-11 17:20:39,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.609e+02 1.790e+02 2.044e+02 2.889e+02, threshold=3.579e+02, percent-clipped=0.0 2023-10-11 17:20:48,074 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.75 vs. limit=10.0 2023-10-11 17:20:49,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.15 vs. limit=15.0 2023-10-11 17:20:53,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.28 vs. limit=15.0 2023-10-11 17:20:54,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=777798.0, ans=0.2 2023-10-11 17:20:54,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.61 vs. limit=15.0 2023-10-11 17:21:05,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=777844.6666666666, ans=0.0 2023-10-11 17:21:06,717 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.76 vs. limit=22.5 2023-10-11 17:21:09,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=777844.6666666666, ans=0.1 2023-10-11 17:21:35,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=777938.0, ans=0.2 2023-10-11 17:21:38,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=777938.0, ans=0.125 2023-10-11 17:21:48,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.84 vs. limit=22.5 2023-10-11 17:21:57,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=778031.3333333334, ans=0.09899494936611666 2023-10-11 17:22:21,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=778124.6666666666, ans=0.02 2023-10-11 17:22:39,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=778171.3333333334, ans=22.5 2023-10-11 17:22:44,541 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.725e+02 1.875e+02 2.079e+02 2.847e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-11 17:22:55,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-10-11 17:23:06,070 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:23:06,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=778264.6666666666, ans=0.2 2023-10-11 17:23:17,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=778311.3333333334, ans=0.1 2023-10-11 17:23:21,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=778311.3333333334, ans=0.125 2023-10-11 17:23:22,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=778358.0, ans=0.1 2023-10-11 17:23:26,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=778358.0, ans=0.125 2023-10-11 17:23:30,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=778358.0, ans=0.1 2023-10-11 17:23:33,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=778358.0, ans=0.125 2023-10-11 17:23:33,687 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-10-11 17:23:40,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=778404.6666666666, ans=0.0 2023-10-11 17:23:46,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=778451.3333333334, ans=0.125 2023-10-11 17:24:19,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=778591.3333333334, ans=0.09899494936611666 2023-10-11 17:24:21,604 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.04 vs. limit=15.0 2023-10-11 17:24:33,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=778638.0, ans=0.2 2023-10-11 17:24:42,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.677e+02 1.775e+02 1.937e+02 2.601e+02, threshold=3.551e+02, percent-clipped=0.0 2023-10-11 17:24:42,942 INFO [train.py:1031] (0/4) Epoch 13, batch 3000, loss[loss=0.1974, simple_loss=0.2867, pruned_loss=0.05409, over 15967.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2888, pruned_loss=0.05542, over 25445341.41 frames. ], batch size: 43, lr: 2.75e-03, grad_scale: 16.0 2023-10-11 17:24:57,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=778731.3333333334, ans=0.125 2023-10-11 17:25:14,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=778778.0, ans=0.1 2023-10-11 17:25:23,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=778824.6666666666, ans=0.125 2023-10-11 17:25:25,995 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-11 17:25:36,846 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.26 vs. limit=15.0 2023-10-11 17:25:37,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=778871.3333333334, ans=0.0 2023-10-11 17:25:49,760 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-10-11 17:25:50,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=778918.0, ans=0.0 2023-10-11 17:26:03,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=779011.3333333334, ans=0.0 2023-10-11 17:26:17,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=779058.0, ans=0.2 2023-10-11 17:26:41,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=779151.3333333334, ans=0.07 2023-10-11 17:26:42,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.738e+02 1.954e+02 2.228e+02 3.012e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-11 17:26:47,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=779151.3333333334, ans=0.125 2023-10-11 17:27:01,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=779198.0, ans=0.0 2023-10-11 17:27:03,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=779198.0, ans=0.04949747468305833 2023-10-11 17:27:05,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=779198.0, ans=0.0 2023-10-11 17:27:08,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=779244.6666666666, ans=0.09899494936611666 2023-10-11 17:27:13,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=779244.6666666666, ans=0.0 2023-10-11 17:27:13,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=779244.6666666666, ans=0.125 2023-10-11 17:27:17,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=779244.6666666666, ans=0.125 2023-10-11 17:27:17,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=779244.6666666666, ans=0.125 2023-10-11 17:27:45,452 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-10-11 17:28:03,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779478.0, ans=0.1 2023-10-11 17:28:03,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.02 vs. limit=15.0 2023-10-11 17:28:17,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779524.6666666666, ans=0.1 2023-10-11 17:28:19,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=779524.6666666666, ans=0.0 2023-10-11 17:28:32,203 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-10-11 17:28:34,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-10-11 17:28:38,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=779571.3333333334, ans=0.07 2023-10-11 17:28:38,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=779571.3333333334, ans=0.0 2023-10-11 17:28:40,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.642e+02 1.814e+02 2.078e+02 2.832e+02, threshold=3.627e+02, percent-clipped=0.0 2023-10-11 17:28:40,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=779618.0, ans=0.125 2023-10-11 17:28:46,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-10-11 17:29:17,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=779711.3333333334, ans=10.0 2023-10-11 17:29:27,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=779758.0, ans=0.0 2023-10-11 17:29:51,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=779851.3333333334, ans=0.0 2023-10-11 17:29:59,657 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.65 vs. limit=6.0 2023-10-11 17:30:23,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=779944.6666666666, ans=0.0 2023-10-11 17:30:38,541 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:30:41,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=780038.0, ans=0.125 2023-10-11 17:30:49,199 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.301e+02 1.683e+02 1.870e+02 2.036e+02 2.917e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-11 17:30:55,111 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:30:57,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=780084.6666666666, ans=0.2 2023-10-11 17:31:16,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=780178.0, ans=0.0 2023-10-11 17:31:17,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=780178.0, ans=0.2 2023-10-11 17:31:17,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=780178.0, ans=0.0 2023-10-11 17:31:35,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=780271.3333333334, ans=6.0 2023-10-11 17:32:01,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=780364.6666666666, ans=0.0 2023-10-11 17:32:03,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=780364.6666666666, ans=0.125 2023-10-11 17:32:15,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=780411.3333333334, ans=0.05 2023-10-11 17:32:37,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=780504.6666666666, ans=0.125 2023-10-11 17:32:45,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.734e+02 1.884e+02 2.059e+02 3.236e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-11 17:33:04,037 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=22.5 2023-10-11 17:33:06,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=780598.0, ans=0.07 2023-10-11 17:33:50,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=780784.6666666666, ans=22.5 2023-10-11 17:33:57,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=780831.3333333334, ans=0.0 2023-10-11 17:33:58,401 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:34:03,837 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:34:21,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=780924.6666666666, ans=0.125 2023-10-11 17:34:26,636 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-10-11 17:34:34,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=780971.3333333334, ans=0.125 2023-10-11 17:34:37,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=780971.3333333334, ans=0.0 2023-10-11 17:34:41,134 INFO [train.py:1031] (0/4) Epoch 13, batch 3500, loss[loss=0.2603, simple_loss=0.3258, pruned_loss=0.09742, over 15796.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2889, pruned_loss=0.0555, over 27081674.45 frames. ], batch size: 350, lr: 2.74e-03, grad_scale: 16.0 2023-10-11 17:34:42,030 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.705e+02 1.908e+02 2.146e+02 2.814e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-11 17:34:56,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=781064.6666666666, ans=0.0 2023-10-11 17:35:02,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=781111.3333333334, ans=0.2 2023-10-11 17:35:14,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=781111.3333333334, ans=0.2 2023-10-11 17:35:24,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=781158.0, ans=0.0 2023-10-11 17:35:45,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=781251.3333333334, ans=0.07 2023-10-11 17:36:01,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=781344.6666666666, ans=0.0 2023-10-11 17:36:03,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=781344.6666666666, ans=0.125 2023-10-11 17:36:03,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=781344.6666666666, ans=0.2 2023-10-11 17:36:07,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=781344.6666666666, ans=0.07 2023-10-11 17:36:46,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.731e+02 1.931e+02 2.128e+02 3.547e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-11 17:37:11,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=781578.0, ans=0.125 2023-10-11 17:37:31,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=781624.6666666666, ans=0.125 2023-10-11 17:38:13,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=781811.3333333334, ans=0.125 2023-10-11 17:38:35,797 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:38:40,522 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.666e+02 1.787e+02 1.979e+02 3.025e+02, threshold=3.575e+02, percent-clipped=0.0 2023-10-11 17:39:49,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=782184.6666666666, ans=0.1 2023-10-11 17:39:51,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=782184.6666666666, ans=0.1 2023-10-11 17:40:17,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=782278.0, ans=0.0 2023-10-11 17:40:19,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=782324.6666666666, ans=0.125 2023-10-11 17:40:43,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.700e+02 1.874e+02 2.110e+02 2.990e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-11 17:41:00,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=782464.6666666666, ans=0.025 2023-10-11 17:41:04,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=782464.6666666666, ans=0.0 2023-10-11 17:41:05,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=782511.3333333334, ans=0.125 2023-10-11 17:41:12,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=782511.3333333334, ans=0.125 2023-10-11 17:41:31,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.14 vs. limit=10.0 2023-10-11 17:41:43,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=22.5 2023-10-11 17:41:43,978 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=12.0 2023-10-11 17:42:02,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-10-11 17:42:20,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=782838.0, ans=0.125 2023-10-11 17:42:24,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=782838.0, ans=0.125 2023-10-11 17:42:32,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.656e+02 1.815e+02 2.082e+02 3.300e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-11 17:42:54,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=782978.0, ans=0.125 2023-10-11 17:42:56,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=782978.0, ans=0.1 2023-10-11 17:42:57,583 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.31 vs. limit=22.5 2023-10-11 17:43:04,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=783024.6666666666, ans=0.1 2023-10-11 17:43:12,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=783071.3333333334, ans=0.0 2023-10-11 17:43:13,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=783071.3333333334, ans=0.125 2023-10-11 17:43:18,128 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=15.0 2023-10-11 17:43:56,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=783258.0, ans=0.1 2023-10-11 17:44:01,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=783258.0, ans=0.2 2023-10-11 17:44:05,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=783304.6666666666, ans=0.125 2023-10-11 17:44:13,439 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.01 vs. limit=15.0 2023-10-11 17:44:16,753 INFO [train.py:1031] (0/4) Epoch 13, batch 4000, loss[loss=0.1933, simple_loss=0.2904, pruned_loss=0.04806, over 16807.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2883, pruned_loss=0.05543, over 28332097.66 frames. ], batch size: 175, lr: 2.74e-03, grad_scale: 32.0 2023-10-11 17:44:18,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=783351.3333333334, ans=0.125 2023-10-11 17:44:18,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.745e+02 1.949e+02 2.315e+02 3.708e+02, threshold=3.898e+02, percent-clipped=2.0 2023-10-11 17:44:24,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=783351.3333333334, ans=0.0 2023-10-11 17:44:51,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=783444.6666666666, ans=0.0 2023-10-11 17:45:10,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=783538.0, ans=0.2 2023-10-11 17:45:12,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=783584.6666666666, ans=0.125 2023-10-11 17:45:14,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=783584.6666666666, ans=0.1 2023-10-11 17:45:26,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=783631.3333333334, ans=0.125 2023-10-11 17:45:31,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=783631.3333333334, ans=15.0 2023-10-11 17:45:34,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=783678.0, ans=0.0 2023-10-11 17:45:36,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=783678.0, ans=0.125 2023-10-11 17:45:59,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=783771.3333333334, ans=0.125 2023-10-11 17:46:00,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=783771.3333333334, ans=0.125 2023-10-11 17:46:07,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=783818.0, ans=0.1 2023-10-11 17:46:09,535 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.734e+02 1.882e+02 2.088e+02 2.775e+02, threshold=3.763e+02, percent-clipped=0.0 2023-10-11 17:46:15,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783818.0, ans=0.1 2023-10-11 17:46:16,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=783818.0, ans=0.1 2023-10-11 17:46:26,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=783864.6666666666, ans=0.125 2023-10-11 17:46:35,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=783911.3333333334, ans=0.125 2023-10-11 17:46:47,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=783958.0, ans=0.125 2023-10-11 17:46:48,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=783958.0, ans=0.0 2023-10-11 17:46:53,015 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-168000.pt 2023-10-11 17:46:56,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=784004.6666666666, ans=0.0 2023-10-11 17:47:32,117 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.31 vs. limit=15.0 2023-10-11 17:48:20,321 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.649e+02 1.822e+02 2.079e+02 2.698e+02, threshold=3.643e+02, percent-clipped=0.0 2023-10-11 17:48:44,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=784378.0, ans=0.05 2023-10-11 17:48:50,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=784378.0, ans=0.125 2023-10-11 17:48:58,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=784424.6666666666, ans=0.1 2023-10-11 17:49:00,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-10-11 17:49:11,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=784471.3333333334, ans=0.1 2023-10-11 17:49:20,069 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-10-11 17:49:24,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=784564.6666666666, ans=0.125 2023-10-11 17:49:40,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=784611.3333333334, ans=0.125 2023-10-11 17:50:08,725 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.789e+02 2.022e+02 2.406e+02 3.663e+02, threshold=4.044e+02, percent-clipped=1.0 2023-10-11 17:50:12,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=784751.3333333334, ans=0.5 2023-10-11 17:50:21,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=784798.0, ans=0.125 2023-10-11 17:50:44,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=784891.3333333334, ans=0.1 2023-10-11 17:50:45,720 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.33 vs. limit=10.0 2023-10-11 17:50:49,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=784938.0, ans=0.0 2023-10-11 17:51:21,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=785031.3333333334, ans=0.1 2023-10-11 17:51:29,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=785078.0, ans=0.1 2023-10-11 17:51:39,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=785124.6666666666, ans=0.09899494936611666 2023-10-11 17:51:45,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=785124.6666666666, ans=0.0 2023-10-11 17:51:59,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=785218.0, ans=0.125 2023-10-11 17:52:02,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.715e+02 1.902e+02 2.068e+02 2.692e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-11 17:52:03,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=785218.0, ans=0.125 2023-10-11 17:52:26,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=785311.3333333334, ans=0.125 2023-10-11 17:52:56,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=785404.6666666666, ans=0.0 2023-10-11 17:53:03,499 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 17:53:09,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=785451.3333333334, ans=0.1 2023-10-11 17:53:21,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=785498.0, ans=0.125 2023-10-11 17:54:01,886 INFO [train.py:1031] (0/4) Epoch 13, batch 4500, loss[loss=0.1968, simple_loss=0.2599, pruned_loss=0.06686, over 12720.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2886, pruned_loss=0.05521, over 29326166.63 frames. ], batch size: 440, lr: 2.74e-03, grad_scale: 32.0 2023-10-11 17:54:05,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.714e+02 1.895e+02 2.126e+02 2.850e+02, threshold=3.789e+02, percent-clipped=0.0 2023-10-11 17:54:06,030 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.56 vs. limit=15.0 2023-10-11 17:55:05,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=785964.6666666666, ans=0.0 2023-10-11 17:55:11,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=785964.6666666666, ans=0.125 2023-10-11 17:55:11,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=785964.6666666666, ans=0.125 2023-10-11 17:55:12,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=786011.3333333334, ans=0.125 2023-10-11 17:55:14,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=786011.3333333334, ans=0.0 2023-10-11 17:55:19,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=786011.3333333334, ans=0.0 2023-10-11 17:55:23,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=786058.0, ans=0.0 2023-10-11 17:55:24,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=786058.0, ans=0.0 2023-10-11 17:55:43,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=786151.3333333334, ans=0.0 2023-10-11 17:55:45,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=786151.3333333334, ans=0.2 2023-10-11 17:55:46,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=786151.3333333334, ans=0.0 2023-10-11 17:55:47,372 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.662e+02 1.817e+02 2.086e+02 2.666e+02, threshold=3.633e+02, percent-clipped=0.0 2023-10-11 17:55:47,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.93 vs. limit=12.0 2023-10-11 17:56:21,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=786291.3333333334, ans=0.2 2023-10-11 17:56:43,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=786384.6666666666, ans=0.0 2023-10-11 17:56:52,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=786431.3333333334, ans=0.0 2023-10-11 17:56:55,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=786431.3333333334, ans=0.125 2023-10-11 17:56:55,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=786431.3333333334, ans=0.0 2023-10-11 17:57:09,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=786524.6666666666, ans=0.2 2023-10-11 17:57:35,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.724e+02 1.862e+02 2.083e+02 2.841e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-11 17:58:42,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=786898.0, ans=0.1 2023-10-11 17:58:55,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=786991.3333333334, ans=0.0 2023-10-11 17:59:03,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=786991.3333333334, ans=0.2 2023-10-11 17:59:04,117 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.16 vs. limit=22.5 2023-10-11 17:59:13,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=787038.0, ans=0.125 2023-10-11 17:59:18,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=787038.0, ans=0.125 2023-10-11 17:59:23,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.740e+02 1.867e+02 2.029e+02 2.533e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-11 17:59:25,613 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.99 vs. limit=6.0 2023-10-11 17:59:31,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.47 vs. limit=15.0 2023-10-11 18:00:17,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=787271.3333333334, ans=0.125 2023-10-11 18:00:21,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=787271.3333333334, ans=0.2 2023-10-11 18:00:25,703 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:00:32,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=787364.6666666666, ans=0.125 2023-10-11 18:00:34,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=787364.6666666666, ans=0.1 2023-10-11 18:00:49,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=787411.3333333334, ans=0.125 2023-10-11 18:01:02,524 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-10-11 18:01:23,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.645e+02 1.780e+02 1.964e+02 2.436e+02, threshold=3.560e+02, percent-clipped=0.0 2023-10-11 18:01:30,016 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=12.0 2023-10-11 18:01:40,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.57 vs. limit=22.5 2023-10-11 18:01:43,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=787644.6666666666, ans=0.125 2023-10-11 18:01:44,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=787644.6666666666, ans=0.0 2023-10-11 18:01:56,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=787691.3333333334, ans=0.125 2023-10-11 18:02:03,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=787691.3333333334, ans=0.0 2023-10-11 18:02:06,976 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:02:41,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=787878.0, ans=0.0 2023-10-11 18:02:47,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=787878.0, ans=0.125 2023-10-11 18:02:48,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=787878.0, ans=0.125 2023-10-11 18:02:56,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=787924.6666666666, ans=0.125 2023-10-11 18:03:16,786 INFO [train.py:1031] (0/4) Epoch 13, batch 5000, loss[loss=0.2, simple_loss=0.2904, pruned_loss=0.05483, over 16898.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2882, pruned_loss=0.05541, over 30037106.52 frames. ], batch size: 110, lr: 2.73e-03, grad_scale: 32.0 2023-10-11 18:03:19,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.736e+02 1.980e+02 2.174e+02 3.063e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-11 18:03:23,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=788018.0, ans=0.0 2023-10-11 18:03:25,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.98 vs. limit=10.0 2023-10-11 18:03:27,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=788064.6666666666, ans=0.2 2023-10-11 18:03:33,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=788064.6666666666, ans=0.125 2023-10-11 18:03:35,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=788064.6666666666, ans=0.125 2023-10-11 18:03:38,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.72 vs. limit=22.5 2023-10-11 18:03:44,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=788111.3333333334, ans=0.0 2023-10-11 18:03:55,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=788158.0, ans=0.125 2023-10-11 18:03:59,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.98 vs. limit=22.5 2023-10-11 18:04:00,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2023-10-11 18:04:04,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=788204.6666666666, ans=0.07 2023-10-11 18:04:17,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=788251.3333333334, ans=0.0 2023-10-11 18:04:21,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=788251.3333333334, ans=0.125 2023-10-11 18:04:23,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=788298.0, ans=0.0 2023-10-11 18:04:27,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=788298.0, ans=0.2 2023-10-11 18:04:33,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=788344.6666666666, ans=0.2 2023-10-11 18:04:53,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=788391.3333333334, ans=0.125 2023-10-11 18:04:56,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=788438.0, ans=0.125 2023-10-11 18:05:08,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=788438.0, ans=0.1 2023-10-11 18:05:15,313 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.725e+02 1.868e+02 2.180e+02 2.976e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-11 18:05:15,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=788484.6666666666, ans=0.125 2023-10-11 18:05:22,661 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.94 vs. limit=15.0 2023-10-11 18:05:24,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=788531.3333333334, ans=0.125 2023-10-11 18:05:45,957 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.26 vs. limit=12.0 2023-10-11 18:06:29,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=788811.3333333334, ans=0.1 2023-10-11 18:06:40,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=788858.0, ans=0.2 2023-10-11 18:06:40,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=788858.0, ans=0.125 2023-10-11 18:06:55,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=788904.6666666666, ans=0.125 2023-10-11 18:07:00,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=788904.6666666666, ans=0.0 2023-10-11 18:07:06,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.676e+02 1.804e+02 1.948e+02 2.868e+02, threshold=3.608e+02, percent-clipped=0.0 2023-10-11 18:07:10,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=788951.3333333334, ans=0.09899494936611666 2023-10-11 18:07:21,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=788998.0, ans=0.125 2023-10-11 18:07:28,143 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.30 vs. limit=15.0 2023-10-11 18:07:34,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=789091.3333333334, ans=0.0 2023-10-11 18:07:35,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=789091.3333333334, ans=0.0 2023-10-11 18:07:42,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=789091.3333333334, ans=0.1 2023-10-11 18:07:54,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=789138.0, ans=0.125 2023-10-11 18:07:54,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=789138.0, ans=0.125 2023-10-11 18:07:59,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=789184.6666666666, ans=0.2 2023-10-11 18:08:17,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=789231.3333333334, ans=0.1 2023-10-11 18:08:18,667 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=15.0 2023-10-11 18:08:23,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=789278.0, ans=0.0 2023-10-11 18:08:24,101 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:08:33,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=789324.6666666666, ans=0.125 2023-10-11 18:08:36,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=789324.6666666666, ans=0.1 2023-10-11 18:08:36,191 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:08:36,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=789324.6666666666, ans=0.07 2023-10-11 18:08:57,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=789418.0, ans=0.1 2023-10-11 18:08:59,291 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.705e+02 1.859e+02 2.067e+02 3.041e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-11 18:09:15,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=789464.6666666666, ans=0.125 2023-10-11 18:09:20,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=789511.3333333334, ans=0.2 2023-10-11 18:09:38,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=789558.0, ans=0.0 2023-10-11 18:10:18,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.20 vs. limit=15.0 2023-10-11 18:10:19,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=789744.6666666666, ans=0.1 2023-10-11 18:10:23,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=789744.6666666666, ans=0.1 2023-10-11 18:10:26,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=789791.3333333334, ans=0.0 2023-10-11 18:10:52,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.252e+02 1.588e+02 1.755e+02 2.001e+02 3.148e+02, threshold=3.511e+02, percent-clipped=0.0 2023-10-11 18:10:53,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=789884.6666666666, ans=0.1 2023-10-11 18:11:19,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=790024.6666666666, ans=0.125 2023-10-11 18:11:51,295 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.50 vs. limit=15.0 2023-10-11 18:12:00,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=790164.6666666666, ans=0.125 2023-10-11 18:12:35,871 INFO [train.py:1031] (0/4) Epoch 13, batch 5500, loss[loss=0.1823, simple_loss=0.2778, pruned_loss=0.04337, over 16842.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2882, pruned_loss=0.05521, over 30666514.41 frames. ], batch size: 175, lr: 2.73e-03, grad_scale: 16.0 2023-10-11 18:12:39,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.709e+02 1.875e+02 2.088e+02 3.135e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-11 18:12:58,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=790444.6666666666, ans=0.0 2023-10-11 18:13:00,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=790444.6666666666, ans=0.2 2023-10-11 18:13:01,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=790444.6666666666, ans=0.0 2023-10-11 18:13:03,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=790444.6666666666, ans=0.125 2023-10-11 18:13:08,224 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.92 vs. limit=22.5 2023-10-11 18:13:13,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.58 vs. limit=15.0 2023-10-11 18:13:41,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=790631.3333333334, ans=0.0 2023-10-11 18:13:52,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.80 vs. limit=22.5 2023-10-11 18:13:57,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=790678.0, ans=0.2 2023-10-11 18:13:58,477 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.99 vs. limit=15.0 2023-10-11 18:14:01,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=790678.0, ans=0.2 2023-10-11 18:14:05,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=790724.6666666666, ans=0.0 2023-10-11 18:14:26,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=790818.0, ans=0.1 2023-10-11 18:14:28,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.677e+02 1.810e+02 2.036e+02 3.466e+02, threshold=3.621e+02, percent-clipped=0.0 2023-10-11 18:14:35,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=790864.6666666666, ans=0.0 2023-10-11 18:14:55,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=790911.3333333334, ans=0.125 2023-10-11 18:15:33,498 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.93 vs. limit=10.0 2023-10-11 18:15:45,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=791144.6666666666, ans=0.0 2023-10-11 18:15:46,373 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:15:53,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=791144.6666666666, ans=0.125 2023-10-11 18:16:00,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=791191.3333333334, ans=0.0 2023-10-11 18:16:06,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=791191.3333333334, ans=0.2 2023-10-11 18:16:09,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=791238.0, ans=0.1 2023-10-11 18:16:11,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=791238.0, ans=0.0 2023-10-11 18:16:14,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=791238.0, ans=0.025 2023-10-11 18:16:15,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=791238.0, ans=0.125 2023-10-11 18:16:23,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=791284.6666666666, ans=0.125 2023-10-11 18:16:25,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.740e+02 1.953e+02 2.271e+02 2.998e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-11 18:16:48,207 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.28 vs. limit=22.5 2023-10-11 18:17:02,585 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.93 vs. limit=22.5 2023-10-11 18:17:03,423 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=12.0 2023-10-11 18:17:17,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=791518.0, ans=0.125 2023-10-11 18:17:35,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=791564.6666666666, ans=0.0 2023-10-11 18:17:48,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=791658.0, ans=0.0 2023-10-11 18:17:57,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=791658.0, ans=0.125 2023-10-11 18:18:03,732 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=12.0 2023-10-11 18:18:11,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=791751.3333333334, ans=0.2 2023-10-11 18:18:11,462 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-10-11 18:18:14,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=791751.3333333334, ans=0.0 2023-10-11 18:18:17,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.672e+02 1.808e+02 1.993e+02 3.066e+02, threshold=3.617e+02, percent-clipped=0.0 2023-10-11 18:18:22,008 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.15 vs. limit=15.0 2023-10-11 18:18:40,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=791844.6666666666, ans=0.1 2023-10-11 18:18:41,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=791844.6666666666, ans=0.0 2023-10-11 18:18:47,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=791891.3333333334, ans=0.1 2023-10-11 18:19:09,191 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.43 vs. limit=15.0 2023-10-11 18:19:10,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=791984.6666666666, ans=0.125 2023-10-11 18:19:12,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=791984.6666666666, ans=0.0 2023-10-11 18:19:47,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=792124.6666666666, ans=0.1 2023-10-11 18:20:00,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=792171.3333333334, ans=0.1 2023-10-11 18:20:11,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.658e+02 1.820e+02 2.036e+02 3.402e+02, threshold=3.641e+02, percent-clipped=0.0 2023-10-11 18:20:22,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=792264.6666666666, ans=0.0 2023-10-11 18:20:28,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=792311.3333333334, ans=0.1 2023-10-11 18:20:41,391 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:20:44,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=792358.0, ans=0.0 2023-10-11 18:21:23,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=792544.6666666666, ans=0.1 2023-10-11 18:21:27,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=792544.6666666666, ans=0.125 2023-10-11 18:21:56,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=792638.0, ans=0.125 2023-10-11 18:21:59,007 INFO [train.py:1031] (0/4) Epoch 13, batch 6000, loss[loss=0.2219, simple_loss=0.3097, pruned_loss=0.06706, over 16393.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2887, pruned_loss=0.05547, over 31147070.43 frames. ], batch size: 50, lr: 2.72e-03, grad_scale: 32.0 2023-10-11 18:22:02,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=792684.6666666666, ans=0.125 2023-10-11 18:22:04,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.332e+02 1.726e+02 1.929e+02 2.173e+02 3.653e+02, threshold=3.859e+02, percent-clipped=1.0 2023-10-11 18:22:12,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=792731.3333333334, ans=0.1 2023-10-11 18:22:14,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=792731.3333333334, ans=0.07 2023-10-11 18:22:18,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=792731.3333333334, ans=0.04949747468305833 2023-10-11 18:22:29,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=792778.0, ans=10.0 2023-10-11 18:22:29,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=792778.0, ans=0.125 2023-10-11 18:22:32,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=792824.6666666666, ans=0.2 2023-10-11 18:22:38,266 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-10-11 18:22:40,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=792824.6666666666, ans=0.5 2023-10-11 18:22:40,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=792824.6666666666, ans=0.125 2023-10-11 18:23:01,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=792918.0, ans=0.125 2023-10-11 18:23:08,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=792964.6666666666, ans=0.1 2023-10-11 18:23:11,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=792964.6666666666, ans=0.0 2023-10-11 18:23:14,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=792964.6666666666, ans=0.125 2023-10-11 18:23:27,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=793058.0, ans=0.04949747468305833 2023-10-11 18:23:32,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=793058.0, ans=0.125 2023-10-11 18:23:32,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=793058.0, ans=0.025 2023-10-11 18:23:45,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=793104.6666666666, ans=0.125 2023-10-11 18:23:49,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=793151.3333333334, ans=0.125 2023-10-11 18:23:50,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=793151.3333333334, ans=0.2 2023-10-11 18:23:52,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=793151.3333333334, ans=0.125 2023-10-11 18:23:54,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.705e+02 1.851e+02 2.019e+02 3.224e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-11 18:23:57,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=793151.3333333334, ans=0.0 2023-10-11 18:24:14,399 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:24:14,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=793244.6666666666, ans=0.07 2023-10-11 18:24:26,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=793291.3333333334, ans=0.125 2023-10-11 18:24:46,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=793384.6666666666, ans=0.125 2023-10-11 18:25:01,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=12.0 2023-10-11 18:25:04,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=793478.0, ans=0.0 2023-10-11 18:25:15,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=793524.6666666666, ans=0.125 2023-10-11 18:25:20,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=793524.6666666666, ans=0.0 2023-10-11 18:25:25,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=793524.6666666666, ans=0.125 2023-10-11 18:25:30,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=793571.3333333334, ans=0.125 2023-10-11 18:25:43,148 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.776e+02 1.920e+02 2.180e+02 2.938e+02, threshold=3.840e+02, percent-clipped=0.0 2023-10-11 18:25:44,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=793618.0, ans=0.125 2023-10-11 18:25:51,680 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.39 vs. limit=15.0 2023-10-11 18:26:16,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=793758.0, ans=0.0 2023-10-11 18:27:06,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=793944.6666666666, ans=0.0 2023-10-11 18:27:19,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=794038.0, ans=0.1 2023-10-11 18:27:36,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.730e+02 2.013e+02 2.258e+02 3.283e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-11 18:27:41,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=794131.3333333334, ans=0.2 2023-10-11 18:27:42,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=794131.3333333334, ans=0.125 2023-10-11 18:28:19,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=794271.3333333334, ans=0.125 2023-10-11 18:28:21,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=794271.3333333334, ans=0.1 2023-10-11 18:29:06,460 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:29:27,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=794504.6666666666, ans=0.125 2023-10-11 18:29:29,981 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.41 vs. limit=15.0 2023-10-11 18:29:36,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=794551.3333333334, ans=0.125 2023-10-11 18:29:37,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.694e+02 1.898e+02 2.109e+02 3.493e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-11 18:29:53,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-10-11 18:29:55,603 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:30:00,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=794644.6666666666, ans=0.1 2023-10-11 18:30:04,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=794691.3333333334, ans=0.125 2023-10-11 18:30:06,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=794691.3333333334, ans=0.0 2023-10-11 18:30:10,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=794691.3333333334, ans=0.125 2023-10-11 18:30:17,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=794738.0, ans=0.2 2023-10-11 18:30:28,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=794784.6666666666, ans=0.125 2023-10-11 18:30:28,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=794784.6666666666, ans=0.0 2023-10-11 18:30:47,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=794878.0, ans=0.0 2023-10-11 18:30:57,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=794878.0, ans=0.0 2023-10-11 18:31:03,358 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-11 18:31:05,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=12.0 2023-10-11 18:31:07,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=794924.6666666666, ans=0.125 2023-10-11 18:31:16,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=794971.3333333334, ans=0.125 2023-10-11 18:31:17,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=794971.3333333334, ans=0.125 2023-10-11 18:31:23,225 INFO [train.py:1031] (0/4) Epoch 13, batch 6500, loss[loss=0.2006, simple_loss=0.2917, pruned_loss=0.05478, over 16878.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2892, pruned_loss=0.05572, over 31512159.21 frames. ], batch size: 72, lr: 2.72e-03, grad_scale: 32.0 2023-10-11 18:31:30,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.734e+02 1.911e+02 2.094e+02 2.626e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-11 18:31:58,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=795111.3333333334, ans=0.125 2023-10-11 18:32:02,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=795111.3333333334, ans=0.0 2023-10-11 18:32:15,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=795158.0, ans=0.125 2023-10-11 18:32:22,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=795204.6666666666, ans=0.125 2023-10-11 18:32:23,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=795204.6666666666, ans=0.1 2023-10-11 18:33:22,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=795438.0, ans=0.125 2023-10-11 18:33:22,892 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.87 vs. limit=22.5 2023-10-11 18:33:33,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.734e+02 1.871e+02 2.130e+02 2.932e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-11 18:33:49,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=795578.0, ans=0.125 2023-10-11 18:33:58,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=795624.6666666666, ans=0.125 2023-10-11 18:34:02,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=795624.6666666666, ans=0.2 2023-10-11 18:34:10,349 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.35 vs. limit=15.0 2023-10-11 18:34:20,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=795718.0, ans=0.0 2023-10-11 18:34:24,798 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:34:32,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=795764.6666666666, ans=0.125 2023-10-11 18:34:52,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=795858.0, ans=0.125 2023-10-11 18:34:54,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.41 vs. limit=10.0 2023-10-11 18:34:59,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=795858.0, ans=0.0 2023-10-11 18:35:13,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=795951.3333333334, ans=0.0 2023-10-11 18:35:16,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=795951.3333333334, ans=0.0 2023-10-11 18:35:16,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.81 vs. limit=10.0 2023-10-11 18:35:21,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.684e+02 1.938e+02 2.297e+02 3.586e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-11 18:35:44,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-10-11 18:36:03,234 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-10-11 18:36:15,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=796184.6666666666, ans=0.5 2023-10-11 18:36:23,173 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:36:25,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=796231.3333333334, ans=0.1 2023-10-11 18:36:28,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.19 vs. limit=10.0 2023-10-11 18:36:42,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=796278.0, ans=0.0 2023-10-11 18:36:47,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=796278.0, ans=0.125 2023-10-11 18:36:54,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=796324.6666666666, ans=0.2 2023-10-11 18:37:27,446 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.218e+02 1.607e+02 1.843e+02 2.150e+02 3.395e+02, threshold=3.685e+02, percent-clipped=0.0 2023-10-11 18:37:52,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=796511.3333333334, ans=0.0 2023-10-11 18:38:11,079 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=6.0 2023-10-11 18:38:51,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=796744.6666666666, ans=0.0 2023-10-11 18:39:12,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.00 vs. limit=15.0 2023-10-11 18:39:20,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=796838.0, ans=0.1 2023-10-11 18:39:26,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=796884.6666666666, ans=0.0 2023-10-11 18:39:28,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.621e+02 1.793e+02 2.030e+02 2.953e+02, threshold=3.585e+02, percent-clipped=0.0 2023-10-11 18:39:34,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=796931.3333333334, ans=0.125 2023-10-11 18:39:39,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=796931.3333333334, ans=0.0 2023-10-11 18:39:46,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=796978.0, ans=0.125 2023-10-11 18:39:48,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=796978.0, ans=0.0 2023-10-11 18:39:54,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.39 vs. limit=15.0 2023-10-11 18:40:00,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=797024.6666666666, ans=0.1 2023-10-11 18:40:10,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2023-10-11 18:40:14,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=797071.3333333334, ans=0.0 2023-10-11 18:40:17,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten.whitening_limit, batch_count=797118.0, ans=15.0 2023-10-11 18:40:22,048 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-10-11 18:40:34,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=797164.6666666666, ans=0.1 2023-10-11 18:41:08,973 INFO [train.py:1031] (0/4) Epoch 13, batch 7000, loss[loss=0.2161, simple_loss=0.3017, pruned_loss=0.06523, over 16960.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2897, pruned_loss=0.0556, over 31824904.42 frames. ], batch size: 123, lr: 2.72e-03, grad_scale: 32.0 2023-10-11 18:41:15,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.765e+02 1.918e+02 2.132e+02 3.036e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-11 18:41:28,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=797398.0, ans=0.0 2023-10-11 18:41:33,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=797398.0, ans=0.1 2023-10-11 18:41:50,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=797491.3333333334, ans=0.125 2023-10-11 18:41:51,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=797491.3333333334, ans=0.0 2023-10-11 18:42:08,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=797584.6666666666, ans=0.0 2023-10-11 18:42:09,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=797584.6666666666, ans=0.0 2023-10-11 18:42:10,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.07 vs. limit=15.0 2023-10-11 18:42:12,810 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.25 vs. limit=12.0 2023-10-11 18:42:27,708 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.24 vs. limit=15.0 2023-10-11 18:42:33,012 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.45 vs. limit=15.0 2023-10-11 18:42:41,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=797724.6666666666, ans=0.125 2023-10-11 18:42:42,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=797724.6666666666, ans=0.125 2023-10-11 18:42:51,802 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:43:05,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.737e+02 1.824e+02 2.090e+02 3.041e+02, threshold=3.649e+02, percent-clipped=0.0 2023-10-11 18:43:07,213 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-10-11 18:43:48,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.40 vs. limit=15.0 2023-10-11 18:44:14,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=798098.0, ans=0.1 2023-10-11 18:44:38,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=798238.0, ans=0.0 2023-10-11 18:44:59,303 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.711e+02 1.880e+02 2.094e+02 3.305e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-11 18:45:00,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=798284.6666666666, ans=0.125 2023-10-11 18:45:10,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=798331.3333333334, ans=0.0 2023-10-11 18:45:20,011 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.38 vs. limit=22.5 2023-10-11 18:45:34,047 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.01 vs. limit=10.0 2023-10-11 18:45:35,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=798424.6666666666, ans=0.0 2023-10-11 18:45:44,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=798424.6666666666, ans=0.0 2023-10-11 18:45:56,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=798471.3333333334, ans=0.0 2023-10-11 18:45:56,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=798471.3333333334, ans=0.2 2023-10-11 18:46:23,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=798564.6666666666, ans=15.0 2023-10-11 18:46:32,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=798611.3333333334, ans=0.125 2023-10-11 18:46:49,549 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-10-11 18:47:02,920 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.718e+02 1.864e+02 2.045e+02 2.728e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-11 18:47:04,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=798751.3333333334, ans=0.125 2023-10-11 18:47:34,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=798891.3333333334, ans=0.07 2023-10-11 18:48:54,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=799218.0, ans=0.125 2023-10-11 18:48:56,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.665e+02 1.833e+02 2.031e+02 2.802e+02, threshold=3.666e+02, percent-clipped=0.0 2023-10-11 18:48:57,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-10-11 18:49:02,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=799264.6666666666, ans=0.125 2023-10-11 18:49:09,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=799311.3333333334, ans=0.2 2023-10-11 18:49:09,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=799311.3333333334, ans=0.125 2023-10-11 18:49:11,084 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.53 vs. limit=15.0 2023-10-11 18:49:17,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=799311.3333333334, ans=0.125 2023-10-11 18:49:40,588 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.48 vs. limit=10.0 2023-10-11 18:49:42,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=799451.3333333334, ans=0.1 2023-10-11 18:50:07,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=799544.6666666666, ans=0.125 2023-10-11 18:50:24,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=799591.3333333334, ans=0.125 2023-10-11 18:50:28,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=799638.0, ans=0.1 2023-10-11 18:50:34,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=799638.0, ans=0.0 2023-10-11 18:50:37,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=799638.0, ans=0.09899494936611666 2023-10-11 18:50:39,608 INFO [train.py:1031] (0/4) Epoch 13, batch 7500, loss[loss=0.193, simple_loss=0.2843, pruned_loss=0.05087, over 16879.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2896, pruned_loss=0.05571, over 32016931.66 frames. ], batch size: 98, lr: 2.71e-03, grad_scale: 32.0 2023-10-11 18:50:45,736 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.793e+02 1.960e+02 2.204e+02 2.928e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-11 18:50:46,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=799684.6666666666, ans=0.125 2023-10-11 18:51:04,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=799778.0, ans=0.125 2023-10-11 18:51:14,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=799824.6666666666, ans=0.125 2023-10-11 18:51:50,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=799964.6666666666, ans=0.125 2023-10-11 18:51:53,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.47 vs. limit=15.0 2023-10-11 18:52:04,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=800011.3333333334, ans=0.04949747468305833 2023-10-11 18:52:18,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=800104.6666666666, ans=0.125 2023-10-11 18:52:19,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=800104.6666666666, ans=0.2 2023-10-11 18:52:35,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.689e+02 1.840e+02 2.124e+02 2.979e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-11 18:53:49,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=800431.3333333334, ans=0.125 2023-10-11 18:54:01,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=800478.0, ans=0.2 2023-10-11 18:54:06,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.43 vs. limit=12.0 2023-10-11 18:54:21,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=800571.3333333334, ans=0.0 2023-10-11 18:54:25,187 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.99 vs. limit=10.0 2023-10-11 18:54:28,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.87 vs. limit=15.0 2023-10-11 18:54:39,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.724e+02 1.841e+02 2.058e+02 3.080e+02, threshold=3.682e+02, percent-clipped=0.0 2023-10-11 18:54:40,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=800618.0, ans=0.0 2023-10-11 18:54:49,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=800664.6666666666, ans=0.0 2023-10-11 18:55:06,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=800758.0, ans=0.125 2023-10-11 18:55:13,542 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.25 vs. limit=15.0 2023-10-11 18:55:14,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=800758.0, ans=22.5 2023-10-11 18:55:23,512 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:55:27,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=800851.3333333334, ans=0.125 2023-10-11 18:55:33,344 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:55:39,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=800898.0, ans=0.07 2023-10-11 18:55:42,358 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2023-10-11 18:55:52,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=800944.6666666666, ans=0.125 2023-10-11 18:55:55,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=800944.6666666666, ans=0.125 2023-10-11 18:55:57,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=800991.3333333334, ans=0.125 2023-10-11 18:56:23,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=801084.6666666666, ans=0.125 2023-10-11 18:56:27,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.691e+02 1.939e+02 2.196e+02 3.394e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-11 18:56:33,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=801131.3333333334, ans=0.1 2023-10-11 18:56:47,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=801178.0, ans=0.2 2023-10-11 18:56:49,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=801178.0, ans=0.0 2023-10-11 18:57:05,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=801224.6666666666, ans=0.125 2023-10-11 18:57:06,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=801224.6666666666, ans=0.2 2023-10-11 18:57:07,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=801224.6666666666, ans=0.125 2023-10-11 18:57:10,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.85 vs. limit=10.0 2023-10-11 18:57:14,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.11 vs. limit=15.0 2023-10-11 18:57:19,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=801271.3333333334, ans=0.125 2023-10-11 18:57:27,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=801318.0, ans=0.1 2023-10-11 18:57:34,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=801364.6666666666, ans=0.1 2023-10-11 18:57:41,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=801364.6666666666, ans=0.95 2023-10-11 18:57:42,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=801411.3333333334, ans=0.125 2023-10-11 18:57:46,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=801411.3333333334, ans=0.1 2023-10-11 18:57:52,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=801411.3333333334, ans=0.0 2023-10-11 18:57:56,761 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 18:58:12,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=801504.6666666666, ans=0.04949747468305833 2023-10-11 18:58:22,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=801551.3333333334, ans=10.0 2023-10-11 18:58:26,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=801551.3333333334, ans=0.2 2023-10-11 18:58:26,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=801551.3333333334, ans=0.1 2023-10-11 18:58:27,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.689e+02 1.833e+02 2.023e+02 2.910e+02, threshold=3.665e+02, percent-clipped=0.0 2023-10-11 18:58:29,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=801598.0, ans=0.125 2023-10-11 18:58:29,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=801598.0, ans=10.0 2023-10-11 18:59:11,145 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.64 vs. limit=6.0 2023-10-11 18:59:15,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=801738.0, ans=0.09899494936611666 2023-10-11 18:59:27,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=801784.6666666666, ans=15.0 2023-10-11 18:59:41,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=801878.0, ans=0.0 2023-10-11 18:59:56,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=801924.6666666666, ans=0.0 2023-10-11 19:00:01,155 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.00 vs. limit=22.5 2023-10-11 19:00:12,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=801971.3333333334, ans=22.5 2023-10-11 19:00:14,144 INFO [train.py:1031] (0/4) Epoch 13, batch 8000, loss[loss=0.1847, simple_loss=0.278, pruned_loss=0.04571, over 16663.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2888, pruned_loss=0.05494, over 32204642.61 frames. ], batch size: 220, lr: 2.71e-03, grad_scale: 32.0 2023-10-11 19:00:21,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.650e+02 1.768e+02 2.048e+02 3.506e+02, threshold=3.537e+02, percent-clipped=0.0 2023-10-11 19:00:28,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=802064.6666666666, ans=0.0 2023-10-11 19:00:31,809 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-10-11 19:00:38,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=802111.3333333334, ans=0.125 2023-10-11 19:00:46,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=802158.0, ans=0.125 2023-10-11 19:00:46,960 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2023-10-11 19:01:14,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=802251.3333333334, ans=0.2 2023-10-11 19:01:20,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=802298.0, ans=0.125 2023-10-11 19:01:31,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=802344.6666666666, ans=0.0 2023-10-11 19:01:32,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=802344.6666666666, ans=0.125 2023-10-11 19:01:53,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=802438.0, ans=0.1 2023-10-11 19:02:05,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.652e+02 1.803e+02 2.042e+02 2.927e+02, threshold=3.605e+02, percent-clipped=0.0 2023-10-11 19:02:08,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.00 vs. limit=22.5 2023-10-11 19:02:23,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.30 vs. limit=22.5 2023-10-11 19:02:36,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=802624.6666666666, ans=0.05 2023-10-11 19:03:16,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=802764.6666666666, ans=0.0 2023-10-11 19:03:37,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=802811.3333333334, ans=0.0 2023-10-11 19:04:09,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-10-11 19:04:13,668 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.620e+02 1.804e+02 2.092e+02 2.694e+02, threshold=3.608e+02, percent-clipped=0.0 2023-10-11 19:04:14,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=802951.3333333334, ans=0.125 2023-10-11 19:04:28,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-10-11 19:04:47,308 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.90 vs. limit=10.0 2023-10-11 19:04:52,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=803138.0, ans=0.0 2023-10-11 19:05:05,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=803184.6666666666, ans=0.05 2023-10-11 19:05:10,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=803184.6666666666, ans=0.1 2023-10-11 19:05:15,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=803231.3333333334, ans=0.125 2023-10-11 19:05:28,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=803278.0, ans=0.125 2023-10-11 19:05:29,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=803278.0, ans=0.04949747468305833 2023-10-11 19:05:42,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=803371.3333333334, ans=0.1 2023-10-11 19:05:43,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=803371.3333333334, ans=0.125 2023-10-11 19:05:55,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-10-11 19:05:57,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.68 vs. limit=5.0 2023-10-11 19:06:03,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=803418.0, ans=0.1 2023-10-11 19:06:04,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.709e+02 1.898e+02 2.148e+02 2.911e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-11 19:06:09,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=803464.6666666666, ans=0.125 2023-10-11 19:06:28,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=803511.3333333334, ans=0.125 2023-10-11 19:06:30,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=803558.0, ans=0.125 2023-10-11 19:06:32,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=803558.0, ans=0.125 2023-10-11 19:06:33,647 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.28 vs. limit=22.5 2023-10-11 19:06:35,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=803558.0, ans=0.125 2023-10-11 19:06:43,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=803604.6666666666, ans=0.125 2023-10-11 19:06:48,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=803604.6666666666, ans=0.2 2023-10-11 19:06:51,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=803604.6666666666, ans=0.125 2023-10-11 19:07:19,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=803744.6666666666, ans=0.125 2023-10-11 19:07:40,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=803838.0, ans=0.125 2023-10-11 19:07:45,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=803838.0, ans=0.1 2023-10-11 19:08:00,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.699e+02 1.900e+02 2.073e+02 4.054e+02, threshold=3.800e+02, percent-clipped=1.0 2023-10-11 19:08:17,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=803978.0, ans=0.2 2023-10-11 19:08:21,590 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:08:25,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=804024.6666666666, ans=0.125 2023-10-11 19:08:25,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=804024.6666666666, ans=0.2 2023-10-11 19:09:07,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=804164.6666666666, ans=0.125 2023-10-11 19:09:19,659 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:09:36,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=804258.0, ans=0.0 2023-10-11 19:09:50,944 INFO [train.py:1031] (0/4) Epoch 13, batch 8500, loss[loss=0.1952, simple_loss=0.288, pruned_loss=0.05123, over 16906.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2889, pruned_loss=0.05483, over 32317627.09 frames. ], batch size: 110, lr: 2.70e-03, grad_scale: 32.0 2023-10-11 19:09:59,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.731e+02 1.912e+02 2.111e+02 2.937e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-11 19:10:20,153 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:10:42,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=804538.0, ans=0.0 2023-10-11 19:11:27,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=804724.6666666666, ans=0.125 2023-10-11 19:11:34,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=804771.3333333334, ans=0.0 2023-10-11 19:11:57,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=804818.0, ans=0.0 2023-10-11 19:12:02,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.776e+02 2.022e+02 2.289e+02 3.942e+02, threshold=4.044e+02, percent-clipped=1.0 2023-10-11 19:12:10,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=804864.6666666666, ans=0.125 2023-10-11 19:12:14,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=804864.6666666666, ans=0.035 2023-10-11 19:12:17,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=804911.3333333334, ans=0.125 2023-10-11 19:12:22,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=804911.3333333334, ans=0.125 2023-10-11 19:12:41,247 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.07 vs. limit=22.5 2023-10-11 19:12:46,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=805004.6666666666, ans=0.125 2023-10-11 19:12:57,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=805051.3333333334, ans=0.07 2023-10-11 19:12:58,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=805051.3333333334, ans=0.125 2023-10-11 19:13:04,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=805098.0, ans=0.0 2023-10-11 19:13:18,730 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.54 vs. limit=22.5 2023-10-11 19:13:20,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=805144.6666666666, ans=0.125 2023-10-11 19:13:26,313 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2023-10-11 19:13:31,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=805191.3333333334, ans=0.125 2023-10-11 19:14:03,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=805284.6666666666, ans=0.0 2023-10-11 19:14:05,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.566e+02 1.709e+02 1.961e+02 2.849e+02, threshold=3.417e+02, percent-clipped=0.0 2023-10-11 19:14:07,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=805331.3333333334, ans=0.0 2023-10-11 19:14:14,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=805331.3333333334, ans=0.125 2023-10-11 19:14:36,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=805424.6666666666, ans=0.125 2023-10-11 19:14:41,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=805424.6666666666, ans=0.125 2023-10-11 19:14:46,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=805471.3333333334, ans=0.125 2023-10-11 19:15:06,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=805518.0, ans=0.1 2023-10-11 19:15:15,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=805564.6666666666, ans=0.05 2023-10-11 19:15:28,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=805611.3333333334, ans=0.1 2023-10-11 19:15:29,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=805611.3333333334, ans=0.125 2023-10-11 19:15:32,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=805611.3333333334, ans=0.2 2023-10-11 19:15:58,275 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.08 vs. limit=10.0 2023-10-11 19:16:08,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.635e+02 1.806e+02 2.047e+02 3.085e+02, threshold=3.611e+02, percent-clipped=0.0 2023-10-11 19:16:42,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=805938.0, ans=0.125 2023-10-11 19:16:52,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=805984.6666666666, ans=0.125 2023-10-11 19:16:57,580 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.92 vs. limit=15.0 2023-10-11 19:17:11,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.41 vs. limit=6.0 2023-10-11 19:17:18,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-10-11 19:17:43,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=806171.3333333334, ans=10.0 2023-10-11 19:17:49,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=806218.0, ans=0.1 2023-10-11 19:17:49,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=806218.0, ans=0.0 2023-10-11 19:17:57,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.688e+02 1.885e+02 2.093e+02 3.354e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-11 19:18:00,107 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.03 vs. limit=22.5 2023-10-11 19:18:11,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=806311.3333333334, ans=0.125 2023-10-11 19:18:27,035 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-10-11 19:18:30,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=806404.6666666666, ans=0.2 2023-10-11 19:18:36,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=806404.6666666666, ans=0.125 2023-10-11 19:18:36,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2023-10-11 19:18:42,030 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.10 vs. limit=15.0 2023-10-11 19:18:44,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=806451.3333333334, ans=0.07 2023-10-11 19:18:44,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=806451.3333333334, ans=0.0 2023-10-11 19:18:45,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.29 vs. limit=12.0 2023-10-11 19:19:01,022 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:19:02,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=806498.0, ans=0.0 2023-10-11 19:19:31,977 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.43 vs. limit=15.0 2023-10-11 19:19:39,337 INFO [train.py:1031] (0/4) Epoch 13, batch 9000, loss[loss=0.2209, simple_loss=0.3044, pruned_loss=0.06865, over 16531.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2884, pruned_loss=0.05464, over 32435800.00 frames. ], batch size: 56, lr: 2.70e-03, grad_scale: 32.0 2023-10-11 19:19:47,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=806684.6666666666, ans=0.1 2023-10-11 19:19:48,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.703e+02 1.868e+02 2.088e+02 3.549e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-11 19:20:05,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=806778.0, ans=0.0 2023-10-11 19:20:07,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=806778.0, ans=0.125 2023-10-11 19:20:24,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=806871.3333333334, ans=0.0 2023-10-11 19:20:44,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=806918.0, ans=0.0 2023-10-11 19:20:56,232 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.68 vs. limit=15.0 2023-10-11 19:21:00,731 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-11 19:21:26,646 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.39 vs. limit=15.0 2023-10-11 19:21:38,332 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.616e+02 1.805e+02 2.034e+02 2.671e+02, threshold=3.610e+02, percent-clipped=0.0 2023-10-11 19:21:41,797 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.08 vs. limit=12.0 2023-10-11 19:21:45,466 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.14 vs. limit=10.0 2023-10-11 19:21:52,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=807244.6666666666, ans=0.125 2023-10-11 19:21:53,132 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.66 vs. limit=5.0 2023-10-11 19:22:18,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=807338.0, ans=0.0 2023-10-11 19:22:23,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=807338.0, ans=0.125 2023-10-11 19:22:24,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=807384.6666666666, ans=0.125 2023-10-11 19:22:26,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=807384.6666666666, ans=0.125 2023-10-11 19:22:41,588 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=12.0 2023-10-11 19:22:43,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-10-11 19:22:59,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=12.0 2023-10-11 19:23:13,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=807571.3333333334, ans=0.125 2023-10-11 19:23:19,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=807618.0, ans=0.5 2023-10-11 19:23:27,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.775e+02 1.946e+02 2.169e+02 3.183e+02, threshold=3.891e+02, percent-clipped=0.0 2023-10-11 19:23:55,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.02 vs. limit=15.0 2023-10-11 19:24:07,052 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.03 vs. limit=15.0 2023-10-11 19:24:15,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=807851.3333333334, ans=0.1 2023-10-11 19:24:19,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=15.0 2023-10-11 19:24:33,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=807944.6666666666, ans=0.125 2023-10-11 19:24:41,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=807944.6666666666, ans=0.0 2023-10-11 19:24:47,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=12.0 2023-10-11 19:24:53,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.32 vs. limit=10.0 2023-10-11 19:25:01,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=808038.0, ans=0.1 2023-10-11 19:25:06,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=808084.6666666666, ans=0.07 2023-10-11 19:25:07,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.98 vs. limit=15.0 2023-10-11 19:25:13,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.746e+02 1.947e+02 2.206e+02 3.240e+02, threshold=3.894e+02, percent-clipped=0.0 2023-10-11 19:25:21,873 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.70 vs. limit=15.0 2023-10-11 19:25:22,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=808131.3333333334, ans=0.125 2023-10-11 19:25:25,168 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.03 vs. limit=15.0 2023-10-11 19:25:52,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=808271.3333333334, ans=0.125 2023-10-11 19:25:54,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=808271.3333333334, ans=0.125 2023-10-11 19:25:57,878 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-10-11 19:26:34,948 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=12.0 2023-10-11 19:26:36,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=808411.3333333334, ans=0.125 2023-10-11 19:26:40,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-10-11 19:26:41,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=808411.3333333334, ans=0.125 2023-10-11 19:27:04,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=808504.6666666666, ans=0.0 2023-10-11 19:27:14,886 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.891e+02 2.171e+02 2.514e+02 3.430e+02, threshold=4.342e+02, percent-clipped=0.0 2023-10-11 19:27:35,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=808644.6666666666, ans=0.0 2023-10-11 19:27:48,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=808691.3333333334, ans=0.0 2023-10-11 19:27:49,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=808691.3333333334, ans=0.125 2023-10-11 19:27:52,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=808691.3333333334, ans=0.125 2023-10-11 19:28:07,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=808784.6666666666, ans=0.125 2023-10-11 19:28:07,609 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.83 vs. limit=15.0 2023-10-11 19:28:16,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=808831.3333333334, ans=0.1 2023-10-11 19:28:18,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=808831.3333333334, ans=0.125 2023-10-11 19:28:30,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=808878.0, ans=0.2 2023-10-11 19:28:47,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=808924.6666666666, ans=0.0 2023-10-11 19:28:49,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=808924.6666666666, ans=0.125 2023-10-11 19:29:00,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=808971.3333333334, ans=0.04949747468305833 2023-10-11 19:29:06,676 INFO [train.py:1031] (0/4) Epoch 13, batch 9500, loss[loss=0.1955, simple_loss=0.29, pruned_loss=0.0505, over 16964.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.289, pruned_loss=0.05485, over 32505399.76 frames. ], batch size: 77, lr: 2.70e-03, grad_scale: 32.0 2023-10-11 19:29:15,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.709e+02 1.826e+02 1.980e+02 2.687e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-11 19:29:24,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-10-11 19:29:31,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=809111.3333333334, ans=0.1 2023-10-11 19:29:34,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=809111.3333333334, ans=0.125 2023-10-11 19:30:02,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=809251.3333333334, ans=0.0 2023-10-11 19:30:16,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=809298.0, ans=0.125 2023-10-11 19:30:29,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=809344.6666666666, ans=0.125 2023-10-11 19:30:31,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=809344.6666666666, ans=0.0 2023-10-11 19:30:33,119 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=12.0 2023-10-11 19:30:46,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=809391.3333333334, ans=0.125 2023-10-11 19:30:55,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=809438.0, ans=0.0 2023-10-11 19:31:08,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.694e+02 1.835e+02 2.088e+02 3.171e+02, threshold=3.669e+02, percent-clipped=0.0 2023-10-11 19:31:20,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=809531.3333333334, ans=0.0 2023-10-11 19:31:32,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=809624.6666666666, ans=0.09899494936611666 2023-10-11 19:31:45,709 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:32:00,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809718.0, ans=0.1 2023-10-11 19:32:05,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809718.0, ans=0.1 2023-10-11 19:32:12,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-10-11 19:32:20,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=809811.3333333334, ans=0.125 2023-10-11 19:32:36,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=809858.0, ans=0.0 2023-10-11 19:32:56,518 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:33:03,148 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.719e+02 1.977e+02 2.268e+02 3.748e+02, threshold=3.954e+02, percent-clipped=1.0 2023-10-11 19:33:06,869 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:33:10,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=809998.0, ans=0.125 2023-10-11 19:33:22,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=810044.6666666666, ans=0.125 2023-10-11 19:33:24,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=810044.6666666666, ans=0.0 2023-10-11 19:33:49,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.94 vs. limit=22.5 2023-10-11 19:34:00,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=810231.3333333334, ans=0.125 2023-10-11 19:34:03,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=810231.3333333334, ans=0.125 2023-10-11 19:34:12,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=810278.0, ans=0.125 2023-10-11 19:34:14,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=810278.0, ans=0.0 2023-10-11 19:34:32,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=810324.6666666666, ans=0.125 2023-10-11 19:34:54,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.763e+02 1.991e+02 2.263e+02 3.655e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-11 19:34:58,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=810464.6666666666, ans=0.125 2023-10-11 19:35:18,495 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:35:24,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=810558.0, ans=0.2 2023-10-11 19:35:50,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=810651.3333333334, ans=0.125 2023-10-11 19:35:55,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=810698.0, ans=0.0 2023-10-11 19:36:05,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-10-11 19:36:07,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=810744.6666666666, ans=0.125 2023-10-11 19:36:09,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=810744.6666666666, ans=0.07 2023-10-11 19:36:11,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=810744.6666666666, ans=0.125 2023-10-11 19:36:16,149 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.27 vs. limit=15.0 2023-10-11 19:36:47,871 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.678e+02 1.805e+02 1.974e+02 2.361e+02, threshold=3.609e+02, percent-clipped=0.0 2023-10-11 19:37:02,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=810978.0, ans=0.1 2023-10-11 19:37:14,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=811024.6666666666, ans=0.0 2023-10-11 19:37:44,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=811118.0, ans=0.2 2023-10-11 19:37:54,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=811164.6666666666, ans=0.125 2023-10-11 19:38:05,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.47 vs. limit=22.5 2023-10-11 19:38:10,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=811258.0, ans=0.0 2023-10-11 19:38:13,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=811258.0, ans=0.0 2023-10-11 19:38:13,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=811258.0, ans=0.5 2023-10-11 19:38:14,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=811258.0, ans=0.125 2023-10-11 19:38:16,911 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.93 vs. limit=15.0 2023-10-11 19:38:23,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=811304.6666666666, ans=15.0 2023-10-11 19:38:28,959 INFO [train.py:1031] (0/4) Epoch 13, batch 10000, loss[loss=0.1913, simple_loss=0.2893, pruned_loss=0.0467, over 16956.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2882, pruned_loss=0.05467, over 32544104.37 frames. ], batch size: 104, lr: 2.69e-03, grad_scale: 32.0 2023-10-11 19:38:31,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.50 vs. limit=15.0 2023-10-11 19:38:31,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.58 vs. limit=15.0 2023-10-11 19:38:37,989 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.719e+02 1.922e+02 2.121e+02 2.805e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-11 19:38:38,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=811398.0, ans=0.0 2023-10-11 19:38:41,264 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.61 vs. limit=15.0 2023-10-11 19:38:50,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=811444.6666666666, ans=0.07 2023-10-11 19:38:53,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=811444.6666666666, ans=0.1 2023-10-11 19:38:58,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=811444.6666666666, ans=0.0 2023-10-11 19:39:01,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=811491.3333333334, ans=10.0 2023-10-11 19:39:17,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=811538.0, ans=0.0 2023-10-11 19:39:24,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=811584.6666666666, ans=0.07 2023-10-11 19:39:30,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=811584.6666666666, ans=0.05 2023-10-11 19:40:00,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=811724.6666666666, ans=0.125 2023-10-11 19:40:00,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=811724.6666666666, ans=0.125 2023-10-11 19:40:05,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=811724.6666666666, ans=0.2 2023-10-11 19:40:08,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=811771.3333333334, ans=0.05 2023-10-11 19:40:26,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=811818.0, ans=0.0 2023-10-11 19:40:28,258 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=15.0 2023-10-11 19:40:29,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.678e+02 1.841e+02 2.048e+02 3.171e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-11 19:40:31,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-10-11 19:40:39,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=811864.6666666666, ans=0.0 2023-10-11 19:40:44,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=811911.3333333334, ans=0.0 2023-10-11 19:40:50,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=811911.3333333334, ans=0.1 2023-10-11 19:40:58,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=811958.0, ans=0.125 2023-10-11 19:40:58,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=811958.0, ans=0.125 2023-10-11 19:41:16,802 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.24 vs. limit=15.0 2023-10-11 19:41:17,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=812051.3333333334, ans=0.0 2023-10-11 19:41:33,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=812098.0, ans=0.0 2023-10-11 19:41:42,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=812144.6666666666, ans=0.0 2023-10-11 19:41:44,543 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:41:58,758 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.05 vs. limit=10.0 2023-10-11 19:42:06,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=812238.0, ans=0.2 2023-10-11 19:42:12,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=812284.6666666666, ans=0.0 2023-10-11 19:42:27,079 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.714e+02 1.932e+02 2.217e+02 2.985e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-11 19:42:38,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=812331.3333333334, ans=0.2 2023-10-11 19:42:48,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=812378.0, ans=0.0 2023-10-11 19:43:10,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=812471.3333333334, ans=0.125 2023-10-11 19:43:27,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=812564.6666666666, ans=0.125 2023-10-11 19:43:52,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=812658.0, ans=0.125 2023-10-11 19:43:54,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=812658.0, ans=0.125 2023-10-11 19:44:23,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.672e+02 1.839e+02 2.092e+02 3.152e+02, threshold=3.678e+02, percent-clipped=0.0 2023-10-11 19:44:47,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=812891.3333333334, ans=0.1 2023-10-11 19:45:06,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.39 vs. limit=15.0 2023-10-11 19:45:24,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=813031.3333333334, ans=0.125 2023-10-11 19:45:40,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=813078.0, ans=0.125 2023-10-11 19:46:08,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2023-10-11 19:46:11,305 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:46:12,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.64 vs. limit=15.0 2023-10-11 19:46:19,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.638e+02 1.760e+02 2.053e+02 3.312e+02, threshold=3.521e+02, percent-clipped=0.0 2023-10-11 19:46:30,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=813311.3333333334, ans=0.1 2023-10-11 19:46:31,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=813311.3333333334, ans=0.1 2023-10-11 19:46:48,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=813358.0, ans=0.0 2023-10-11 19:47:04,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=813404.6666666666, ans=0.125 2023-10-11 19:47:17,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.89 vs. limit=15.0 2023-10-11 19:47:20,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=813498.0, ans=0.125 2023-10-11 19:47:41,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=813591.3333333334, ans=0.0 2023-10-11 19:47:52,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=813638.0, ans=0.125 2023-10-11 19:47:57,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=813638.0, ans=0.1 2023-10-11 19:48:01,151 INFO [train.py:1031] (0/4) Epoch 13, batch 10500, loss[loss=0.1957, simple_loss=0.2875, pruned_loss=0.052, over 15690.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2887, pruned_loss=0.05478, over 32603433.63 frames. ], batch size: 35, lr: 2.69e-03, grad_scale: 32.0 2023-10-11 19:48:05,685 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-10-11 19:48:11,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.723e+02 1.882e+02 2.233e+02 3.328e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-11 19:48:12,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=813731.3333333334, ans=0.0 2023-10-11 19:48:13,820 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:48:36,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=813824.6666666666, ans=0.125 2023-10-11 19:48:47,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=813871.3333333334, ans=0.1 2023-10-11 19:48:54,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=813871.3333333334, ans=0.125 2023-10-11 19:48:55,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=813918.0, ans=0.125 2023-10-11 19:49:48,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=814058.0, ans=0.1 2023-10-11 19:49:53,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.03 vs. limit=22.5 2023-10-11 19:50:12,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=814151.3333333334, ans=0.2 2023-10-11 19:50:16,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.703e+02 1.945e+02 2.213e+02 3.175e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-11 19:50:16,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=814198.0, ans=0.025 2023-10-11 19:50:27,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=814244.6666666666, ans=0.125 2023-10-11 19:50:29,932 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:50:40,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-10-11 19:50:47,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=814291.3333333334, ans=0.2 2023-10-11 19:51:08,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=814384.6666666666, ans=0.125 2023-10-11 19:51:10,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=814431.3333333334, ans=0.125 2023-10-11 19:51:12,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=814431.3333333334, ans=0.0 2023-10-11 19:51:32,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=814478.0, ans=0.0 2023-10-11 19:51:45,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=814524.6666666666, ans=0.125 2023-10-11 19:51:51,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=814571.3333333334, ans=0.125 2023-10-11 19:52:02,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=814618.0, ans=0.125 2023-10-11 19:52:12,301 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.288e+02 1.712e+02 1.994e+02 2.267e+02 2.917e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-11 19:52:18,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=814664.6666666666, ans=0.0 2023-10-11 19:52:22,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=814664.6666666666, ans=0.1 2023-10-11 19:52:27,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=814711.3333333334, ans=0.0 2023-10-11 19:52:59,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.whiten.whitening_limit, batch_count=814851.3333333334, ans=12.0 2023-10-11 19:53:03,816 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.00 vs. limit=15.0 2023-10-11 19:53:06,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=814851.3333333334, ans=0.1 2023-10-11 19:53:06,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=814851.3333333334, ans=0.1 2023-10-11 19:53:18,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=814898.0, ans=0.125 2023-10-11 19:53:28,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=814944.6666666666, ans=0.0 2023-10-11 19:53:40,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=814991.3333333334, ans=0.0 2023-10-11 19:53:41,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=814991.3333333334, ans=0.125 2023-10-11 19:54:07,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.801e+02 2.122e+02 2.356e+02 4.342e+02, threshold=4.243e+02, percent-clipped=1.0 2023-10-11 19:54:38,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=815224.6666666666, ans=0.1 2023-10-11 19:54:47,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=815271.3333333334, ans=0.2 2023-10-11 19:54:57,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-10-11 19:55:02,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=815364.6666666666, ans=0.125 2023-10-11 19:55:06,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=815364.6666666666, ans=0.2 2023-10-11 19:55:14,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-10-11 19:55:15,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=815411.3333333334, ans=0.125 2023-10-11 19:55:16,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=815411.3333333334, ans=0.07 2023-10-11 19:56:03,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.311e+02 1.624e+02 1.779e+02 1.972e+02 2.884e+02, threshold=3.557e+02, percent-clipped=0.0 2023-10-11 19:56:08,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=815598.0, ans=0.125 2023-10-11 19:56:08,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=815598.0, ans=0.125 2023-10-11 19:56:46,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=815738.0, ans=0.125 2023-10-11 19:56:54,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=815784.6666666666, ans=0.0 2023-10-11 19:57:45,403 INFO [train.py:1031] (0/4) Epoch 13, batch 11000, loss[loss=0.1959, simple_loss=0.2944, pruned_loss=0.04869, over 16592.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2888, pruned_loss=0.05477, over 32659902.56 frames. ], batch size: 241, lr: 2.68e-03, grad_scale: 32.0 2023-10-11 19:57:49,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-10-11 19:57:56,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.753e+02 1.916e+02 2.204e+02 3.323e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-11 19:58:09,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=816111.3333333334, ans=0.125 2023-10-11 19:58:18,049 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 19:58:37,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=816204.6666666666, ans=0.1 2023-10-11 19:58:45,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=816251.3333333334, ans=0.2 2023-10-11 19:58:46,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=816251.3333333334, ans=0.125 2023-10-11 19:58:48,801 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-10-11 19:59:27,370 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.59 vs. limit=12.0 2023-10-11 19:59:29,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=816438.0, ans=0.2 2023-10-11 19:59:50,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=816484.6666666666, ans=0.0 2023-10-11 19:59:54,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.665e+02 1.857e+02 2.037e+02 2.752e+02, threshold=3.714e+02, percent-clipped=0.0 2023-10-11 20:00:30,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=816624.6666666666, ans=0.125 2023-10-11 20:00:57,861 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.11 vs. limit=22.5 2023-10-11 20:01:07,938 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-10-11 20:01:21,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=816858.0, ans=0.2 2023-10-11 20:01:32,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=816904.6666666666, ans=0.2 2023-10-11 20:01:35,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.28 vs. limit=22.5 2023-10-11 20:01:39,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=816904.6666666666, ans=15.0 2023-10-11 20:01:43,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=816951.3333333334, ans=0.2 2023-10-11 20:01:46,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=816951.3333333334, ans=0.125 2023-10-11 20:01:52,971 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.619e+02 1.826e+02 2.052e+02 2.830e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-11 20:01:57,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=816998.0, ans=0.125 2023-10-11 20:01:57,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=816998.0, ans=0.1 2023-10-11 20:02:31,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=817138.0, ans=0.125 2023-10-11 20:02:33,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=817138.0, ans=0.1 2023-10-11 20:02:58,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=817231.3333333334, ans=0.125 2023-10-11 20:03:11,047 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-10-11 20:03:13,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=817278.0, ans=0.125 2023-10-11 20:03:28,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=817371.3333333334, ans=0.125 2023-10-11 20:03:31,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=817371.3333333334, ans=0.2 2023-10-11 20:03:42,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-10-11 20:03:51,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.668e+02 1.823e+02 2.086e+02 3.030e+02, threshold=3.646e+02, percent-clipped=0.0 2023-10-11 20:04:02,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=12.0 2023-10-11 20:04:11,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=817511.3333333334, ans=0.0 2023-10-11 20:04:22,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.06 vs. limit=22.5 2023-10-11 20:04:50,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=817698.0, ans=0.2 2023-10-11 20:04:54,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=817698.0, ans=0.1 2023-10-11 20:05:17,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=817791.3333333334, ans=0.015 2023-10-11 20:05:31,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=817838.0, ans=0.0 2023-10-11 20:05:45,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.750e+02 1.871e+02 2.135e+02 2.722e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-11 20:05:48,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=817931.3333333334, ans=0.2 2023-10-11 20:05:49,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=817931.3333333334, ans=0.125 2023-10-11 20:06:04,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=817978.0, ans=0.125 2023-10-11 20:06:15,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=818024.6666666666, ans=0.0 2023-10-11 20:06:28,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=818071.3333333334, ans=0.125 2023-10-11 20:06:35,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-10-11 20:06:43,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=818118.0, ans=0.125 2023-10-11 20:07:06,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.82 vs. limit=22.5 2023-10-11 20:07:10,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=818258.0, ans=0.0 2023-10-11 20:07:10,223 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:07:14,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=818258.0, ans=0.04949747468305833 2023-10-11 20:07:30,168 INFO [train.py:1031] (0/4) Epoch 13, batch 11500, loss[loss=0.2108, simple_loss=0.3071, pruned_loss=0.05732, over 16866.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2883, pruned_loss=0.05456, over 32670637.72 frames. ], batch size: 155, lr: 2.68e-03, grad_scale: 32.0 2023-10-11 20:07:31,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-10-11 20:07:33,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=818351.3333333334, ans=0.125 2023-10-11 20:07:40,683 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.851e+02 2.098e+02 2.486e+02 3.679e+02, threshold=4.195e+02, percent-clipped=0.0 2023-10-11 20:07:59,766 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.38 vs. limit=15.0 2023-10-11 20:08:14,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=818538.0, ans=0.125 2023-10-11 20:08:48,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=818631.3333333334, ans=0.125 2023-10-11 20:08:54,235 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.05 vs. limit=22.5 2023-10-11 20:09:09,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=818724.6666666666, ans=0.125 2023-10-11 20:09:10,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=818724.6666666666, ans=0.2 2023-10-11 20:09:16,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=818724.6666666666, ans=0.0 2023-10-11 20:09:31,002 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.62 vs. limit=22.5 2023-10-11 20:09:45,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.646e+02 1.816e+02 2.069e+02 3.193e+02, threshold=3.633e+02, percent-clipped=0.0 2023-10-11 20:09:48,701 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.18 vs. limit=12.0 2023-10-11 20:09:58,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=818911.3333333334, ans=0.1 2023-10-11 20:09:58,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=818911.3333333334, ans=0.0 2023-10-11 20:10:03,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=818911.3333333334, ans=0.2 2023-10-11 20:10:13,765 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-10-11 20:10:25,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=819004.6666666666, ans=0.0 2023-10-11 20:10:39,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=819051.3333333334, ans=0.125 2023-10-11 20:10:45,654 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:10:54,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=819144.6666666666, ans=0.125 2023-10-11 20:10:59,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=819144.6666666666, ans=0.125 2023-10-11 20:11:01,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=819191.3333333334, ans=0.2 2023-10-11 20:11:04,824 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:11:05,103 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.81 vs. limit=15.0 2023-10-11 20:11:33,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.286e+02 1.663e+02 1.794e+02 2.026e+02 2.722e+02, threshold=3.588e+02, percent-clipped=0.0 2023-10-11 20:11:56,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.61 vs. limit=15.0 2023-10-11 20:12:01,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=819424.6666666666, ans=0.05 2023-10-11 20:12:07,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=819471.3333333334, ans=0.0 2023-10-11 20:12:08,128 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.65 vs. limit=22.5 2023-10-11 20:12:30,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=819518.0, ans=0.1 2023-10-11 20:12:42,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=819564.6666666666, ans=0.125 2023-10-11 20:12:42,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=819564.6666666666, ans=0.0 2023-10-11 20:12:42,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=819564.6666666666, ans=0.1 2023-10-11 20:13:19,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=819704.6666666666, ans=0.07 2023-10-11 20:13:25,900 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:13:37,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.639e+02 1.828e+02 2.073e+02 2.770e+02, threshold=3.657e+02, percent-clipped=0.0 2023-10-11 20:13:41,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=819798.0, ans=0.125 2023-10-11 20:13:42,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=819798.0, ans=0.125 2023-10-11 20:13:48,598 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.97 vs. limit=15.0 2023-10-11 20:14:00,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=15.0 2023-10-11 20:14:01,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=819891.3333333334, ans=0.0 2023-10-11 20:14:07,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=819891.3333333334, ans=0.1 2023-10-11 20:14:11,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-10-11 20:14:40,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=820031.3333333334, ans=0.09899494936611666 2023-10-11 20:14:40,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=820031.3333333334, ans=0.0 2023-10-11 20:14:51,953 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:14:56,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=820078.0, ans=0.0 2023-10-11 20:15:20,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=820171.3333333334, ans=0.0 2023-10-11 20:15:29,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=820218.0, ans=0.125 2023-10-11 20:15:34,621 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:15:39,350 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.751e+02 1.878e+02 2.168e+02 3.461e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-11 20:15:40,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.94 vs. limit=15.0 2023-10-11 20:15:47,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=820264.6666666666, ans=0.125 2023-10-11 20:16:00,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=820358.0, ans=0.125 2023-10-11 20:16:12,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=820404.6666666666, ans=0.125 2023-10-11 20:16:16,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=820404.6666666666, ans=0.0 2023-10-11 20:16:21,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-10-11 20:16:26,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=820451.3333333334, ans=0.125 2023-10-11 20:16:57,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=820544.6666666666, ans=0.125 2023-10-11 20:17:21,797 INFO [train.py:1031] (0/4) Epoch 13, batch 12000, loss[loss=0.2108, simple_loss=0.3016, pruned_loss=0.06004, over 16500.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2883, pruned_loss=0.05418, over 32715573.02 frames. ], batch size: 266, lr: 2.68e-03, grad_scale: 32.0 2023-10-11 20:17:34,162 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.652e+02 1.869e+02 2.166e+02 3.234e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-11 20:17:37,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=820731.3333333334, ans=0.0 2023-10-11 20:17:55,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.61 vs. limit=15.0 2023-10-11 20:18:17,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=820871.3333333334, ans=0.125 2023-10-11 20:18:30,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=820918.0, ans=0.1 2023-10-11 20:18:49,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=821011.3333333334, ans=0.125 2023-10-11 20:19:25,717 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=22.5 2023-10-11 20:19:29,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=821151.3333333334, ans=0.05 2023-10-11 20:19:29,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=821151.3333333334, ans=0.125 2023-10-11 20:19:36,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.593e+02 1.840e+02 2.000e+02 3.453e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-11 20:20:03,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=821291.3333333334, ans=0.125 2023-10-11 20:20:05,742 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-176000.pt 2023-10-11 20:20:27,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=821384.6666666666, ans=0.0 2023-10-11 20:20:27,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=821384.6666666666, ans=0.125 2023-10-11 20:20:30,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=821431.3333333334, ans=0.125 2023-10-11 20:20:34,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=821431.3333333334, ans=0.0 2023-10-11 20:20:43,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=821478.0, ans=0.125 2023-10-11 20:20:45,348 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.77 vs. limit=15.0 2023-10-11 20:20:47,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=821478.0, ans=0.125 2023-10-11 20:20:54,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=821524.6666666666, ans=0.125 2023-10-11 20:20:57,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=821524.6666666666, ans=0.1 2023-10-11 20:21:03,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=821571.3333333334, ans=0.125 2023-10-11 20:21:26,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.655e+02 1.843e+02 2.001e+02 2.814e+02, threshold=3.685e+02, percent-clipped=0.0 2023-10-11 20:21:35,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=821664.6666666666, ans=0.125 2023-10-11 20:21:49,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.65 vs. limit=15.0 2023-10-11 20:22:08,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.18 vs. limit=15.0 2023-10-11 20:22:14,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=821851.3333333334, ans=0.0 2023-10-11 20:22:20,106 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:22:28,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=821898.0, ans=0.0 2023-10-11 20:23:08,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=822084.6666666666, ans=0.015 2023-10-11 20:23:08,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=822084.6666666666, ans=0.125 2023-10-11 20:23:10,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=822084.6666666666, ans=0.125 2023-10-11 20:23:18,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.759e+02 2.000e+02 2.189e+02 2.781e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-11 20:23:41,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=822224.6666666666, ans=0.1 2023-10-11 20:23:44,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=822224.6666666666, ans=0.125 2023-10-11 20:24:21,001 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.65 vs. limit=15.0 2023-10-11 20:24:25,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=822364.6666666666, ans=0.125 2023-10-11 20:24:26,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.75 vs. limit=15.0 2023-10-11 20:24:28,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=822364.6666666666, ans=0.125 2023-10-11 20:24:31,129 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:24:42,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=822458.0, ans=0.125 2023-10-11 20:24:58,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.21 vs. limit=15.0 2023-10-11 20:25:08,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=822551.3333333334, ans=0.2 2023-10-11 20:25:17,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.712e+02 1.845e+02 2.064e+02 2.666e+02, threshold=3.690e+02, percent-clipped=0.0 2023-10-11 20:25:26,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=822598.0, ans=0.125 2023-10-11 20:25:33,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-10-11 20:25:39,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=822691.3333333334, ans=0.0 2023-10-11 20:25:39,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=822691.3333333334, ans=0.0 2023-10-11 20:25:59,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=822738.0, ans=0.2 2023-10-11 20:26:02,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=822738.0, ans=0.125 2023-10-11 20:26:23,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=822831.3333333334, ans=0.125 2023-10-11 20:26:52,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=822971.3333333334, ans=0.125 2023-10-11 20:26:54,548 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=12.0 2023-10-11 20:27:02,398 INFO [train.py:1031] (0/4) Epoch 13, batch 12500, loss[loss=0.2051, simple_loss=0.3007, pruned_loss=0.05472, over 16948.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2881, pruned_loss=0.05425, over 32751490.14 frames. ], batch size: 156, lr: 2.67e-03, grad_scale: 64.0 2023-10-11 20:27:13,842 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.68 vs. limit=10.0 2023-10-11 20:27:14,130 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.690e+02 1.886e+02 2.076e+02 3.499e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-11 20:27:17,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=823064.6666666666, ans=0.0 2023-10-11 20:27:18,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=823064.6666666666, ans=0.125 2023-10-11 20:27:30,508 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:27:40,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=823158.0, ans=0.0 2023-10-11 20:28:04,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=823251.3333333334, ans=0.1 2023-10-11 20:28:40,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=823391.3333333334, ans=0.125 2023-10-11 20:28:55,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=823484.6666666666, ans=0.09899494936611666 2023-10-11 20:29:06,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.659e+02 1.791e+02 2.035e+02 4.092e+02, threshold=3.582e+02, percent-clipped=1.0 2023-10-11 20:29:07,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=823531.3333333334, ans=0.1 2023-10-11 20:29:49,649 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:30:18,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=823811.3333333334, ans=0.0 2023-10-11 20:30:21,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=823811.3333333334, ans=0.0 2023-10-11 20:30:24,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=823811.3333333334, ans=0.0 2023-10-11 20:30:28,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=823858.0, ans=0.1 2023-10-11 20:30:37,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=823904.6666666666, ans=0.125 2023-10-11 20:30:44,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=823904.6666666666, ans=0.5 2023-10-11 20:31:03,082 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.656e+02 1.828e+02 2.081e+02 3.632e+02, threshold=3.655e+02, percent-clipped=1.0 2023-10-11 20:31:04,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=823998.0, ans=0.1 2023-10-11 20:31:07,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.61 vs. limit=15.0 2023-10-11 20:31:25,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=824091.3333333334, ans=0.125 2023-10-11 20:32:00,626 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-10-11 20:32:01,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=824231.3333333334, ans=0.125 2023-10-11 20:32:07,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=824231.3333333334, ans=0.125 2023-10-11 20:32:10,494 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.81 vs. limit=6.0 2023-10-11 20:32:19,397 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-10-11 20:32:21,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-10-11 20:32:23,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=824324.6666666666, ans=0.2 2023-10-11 20:32:40,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=824371.3333333334, ans=0.125 2023-10-11 20:32:53,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=824464.6666666666, ans=0.125 2023-10-11 20:32:55,040 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.729e+02 1.860e+02 2.054e+02 3.071e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-11 20:33:13,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=824558.0, ans=0.125 2023-10-11 20:33:25,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=824604.6666666666, ans=0.04949747468305833 2023-10-11 20:33:29,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=824604.6666666666, ans=0.015 2023-10-11 20:34:15,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=824791.3333333334, ans=0.07 2023-10-11 20:34:28,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.24 vs. limit=10.0 2023-10-11 20:34:48,623 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.658e+02 1.901e+02 2.250e+02 3.227e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-11 20:34:57,155 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.01 vs. limit=12.0 2023-10-11 20:34:59,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=824978.0, ans=0.025 2023-10-11 20:35:01,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=824978.0, ans=0.0 2023-10-11 20:35:15,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=825024.6666666666, ans=0.09899494936611666 2023-10-11 20:35:35,548 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.41 vs. limit=15.0 2023-10-11 20:35:41,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=825164.6666666666, ans=0.1 2023-10-11 20:35:56,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=825211.3333333334, ans=0.0 2023-10-11 20:36:23,809 INFO [train.py:1031] (0/4) Epoch 13, batch 13000, loss[loss=0.1994, simple_loss=0.2851, pruned_loss=0.05683, over 16679.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2886, pruned_loss=0.05439, over 32751967.49 frames. ], batch size: 61, lr: 2.67e-03, grad_scale: 16.0 2023-10-11 20:36:29,703 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=15.0 2023-10-11 20:36:31,548 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.56 vs. limit=15.0 2023-10-11 20:36:38,355 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.701e+02 1.901e+02 2.210e+02 3.263e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-11 20:36:58,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=825444.6666666666, ans=0.125 2023-10-11 20:37:27,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=825584.6666666666, ans=0.0 2023-10-11 20:37:35,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=825584.6666666666, ans=0.0 2023-10-11 20:37:36,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.86 vs. limit=15.0 2023-10-11 20:37:37,691 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.92 vs. limit=12.0 2023-10-11 20:38:12,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=825724.6666666666, ans=0.0 2023-10-11 20:38:40,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.691e+02 1.888e+02 2.203e+02 3.476e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-11 20:38:45,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.69 vs. limit=15.0 2023-10-11 20:38:47,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=825864.6666666666, ans=0.0 2023-10-11 20:38:52,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=825911.3333333334, ans=0.0 2023-10-11 20:38:53,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=825911.3333333334, ans=0.125 2023-10-11 20:39:11,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-10-11 20:39:24,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=826051.3333333334, ans=0.125 2023-10-11 20:39:31,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=826051.3333333334, ans=0.125 2023-10-11 20:39:36,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=826098.0, ans=0.125 2023-10-11 20:40:06,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=826191.3333333334, ans=0.0 2023-10-11 20:40:18,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=826238.0, ans=0.125 2023-10-11 20:40:32,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-10-11 20:40:37,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.735e+02 1.902e+02 2.150e+02 2.822e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-11 20:40:47,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.96 vs. limit=10.0 2023-10-11 20:40:51,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=826378.0, ans=0.125 2023-10-11 20:40:51,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=826378.0, ans=0.0 2023-10-11 20:40:58,325 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-11 20:41:02,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=826424.6666666666, ans=0.125 2023-10-11 20:41:04,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=826424.6666666666, ans=0.0 2023-10-11 20:41:16,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=826471.3333333334, ans=0.035 2023-10-11 20:41:21,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=826518.0, ans=0.125 2023-10-11 20:41:24,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=826518.0, ans=0.0 2023-10-11 20:41:33,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=826564.6666666666, ans=0.0 2023-10-11 20:41:44,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=826611.3333333334, ans=15.0 2023-10-11 20:41:50,657 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.53 vs. limit=10.0 2023-10-11 20:42:26,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=826798.0, ans=0.125 2023-10-11 20:42:30,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.695e+02 1.849e+02 2.048e+02 2.577e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-11 20:42:32,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=826798.0, ans=0.07 2023-10-11 20:42:35,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=826798.0, ans=0.125 2023-10-11 20:43:06,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=826938.0, ans=0.125 2023-10-11 20:43:24,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=827031.3333333334, ans=0.0 2023-10-11 20:43:54,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=827171.3333333334, ans=0.2 2023-10-11 20:44:17,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=827264.6666666666, ans=0.125 2023-10-11 20:44:22,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.721e+02 1.914e+02 2.155e+02 2.948e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-11 20:44:55,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=827404.6666666666, ans=0.125 2023-10-11 20:45:18,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=827498.0, ans=0.125 2023-10-11 20:45:27,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=827544.6666666666, ans=0.0 2023-10-11 20:45:27,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=827544.6666666666, ans=0.125 2023-10-11 20:45:32,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.64 vs. limit=15.0 2023-10-11 20:45:36,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=827591.3333333334, ans=6.0 2023-10-11 20:45:47,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=827638.0, ans=0.125 2023-10-11 20:45:56,092 INFO [train.py:1031] (0/4) Epoch 13, batch 13500, loss[loss=0.1961, simple_loss=0.2934, pruned_loss=0.04944, over 16929.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2881, pruned_loss=0.05432, over 32752709.52 frames. ], batch size: 138, lr: 2.67e-03, grad_scale: 32.0 2023-10-11 20:46:09,033 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.744e+02 1.989e+02 2.497e+02 3.903e+02, threshold=3.978e+02, percent-clipped=1.0 2023-10-11 20:46:16,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=827778.0, ans=0.0 2023-10-11 20:46:16,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=827778.0, ans=0.5 2023-10-11 20:46:26,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=827778.0, ans=0.0 2023-10-11 20:46:33,356 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.77 vs. limit=15.0 2023-10-11 20:46:37,258 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.08 vs. limit=15.0 2023-10-11 20:46:38,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=827824.6666666666, ans=0.0 2023-10-11 20:47:04,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=827964.6666666666, ans=0.1 2023-10-11 20:47:06,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=827964.6666666666, ans=0.125 2023-10-11 20:47:08,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=827964.6666666666, ans=0.125 2023-10-11 20:47:08,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=827964.6666666666, ans=0.1 2023-10-11 20:47:10,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=827964.6666666666, ans=0.04949747468305833 2023-10-11 20:47:23,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=828058.0, ans=0.125 2023-10-11 20:47:28,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.97 vs. limit=22.5 2023-10-11 20:47:30,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=828058.0, ans=0.1 2023-10-11 20:47:41,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=828104.6666666666, ans=0.0 2023-10-11 20:47:56,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=828198.0, ans=0.2 2023-10-11 20:47:59,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.693e+02 1.889e+02 2.273e+02 3.403e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-11 20:48:13,915 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:48:20,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=828291.3333333334, ans=0.1 2023-10-11 20:48:39,563 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-13.pt 2023-10-11 20:49:11,103 INFO [train.py:1031] (0/4) Epoch 14, batch 0, loss[loss=0.1784, simple_loss=0.2657, pruned_loss=0.04558, over 16822.00 frames. ], tot_loss[loss=0.1784, simple_loss=0.2657, pruned_loss=0.04558, over 16822.00 frames. ], batch size: 175, lr: 2.56e-03, grad_scale: 32.0 2023-10-11 20:49:11,104 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-11 20:49:19,449 INFO [train.py:1063] (0/4) Epoch 14, validation: loss=0.2166, simple_loss=0.3041, pruned_loss=0.06458, over 1020973.00 frames. 2023-10-11 20:49:19,449 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-11 20:49:23,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.93 vs. limit=15.0 2023-10-11 20:49:29,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=828408.0, ans=0.2 2023-10-11 20:49:30,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=828454.6666666666, ans=0.1 2023-10-11 20:49:51,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=828501.3333333334, ans=0.125 2023-10-11 20:49:59,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=828548.0, ans=0.0 2023-10-11 20:50:09,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.01 vs. limit=15.0 2023-10-11 20:50:14,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=828594.6666666666, ans=0.125 2023-10-11 20:50:23,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=828641.3333333334, ans=0.125 2023-10-11 20:50:26,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.672e+02 1.844e+02 2.142e+02 3.501e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-11 20:50:31,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=828688.0, ans=0.0 2023-10-11 20:50:40,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=828688.0, ans=0.07 2023-10-11 20:50:50,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=828734.6666666666, ans=0.0 2023-10-11 20:51:07,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=828828.0, ans=0.2 2023-10-11 20:51:13,492 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:51:26,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=828921.3333333334, ans=0.0 2023-10-11 20:51:35,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=828921.3333333334, ans=0.125 2023-10-11 20:51:42,432 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=22.5 2023-10-11 20:52:05,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-10-11 20:52:18,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=829108.0, ans=0.125 2023-10-11 20:52:19,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=829108.0, ans=10.0 2023-10-11 20:52:22,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.671e+02 1.824e+02 1.992e+02 2.679e+02, threshold=3.647e+02, percent-clipped=0.0 2023-10-11 20:52:36,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=829201.3333333334, ans=0.125 2023-10-11 20:52:37,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=829201.3333333334, ans=0.0 2023-10-11 20:53:00,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=829294.6666666666, ans=0.0 2023-10-11 20:53:02,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=829294.6666666666, ans=0.0 2023-10-11 20:53:04,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=829294.6666666666, ans=0.125 2023-10-11 20:53:07,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.27 vs. limit=15.0 2023-10-11 20:53:10,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=829341.3333333334, ans=0.125 2023-10-11 20:53:20,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=829388.0, ans=0.125 2023-10-11 20:53:29,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=829388.0, ans=0.0 2023-10-11 20:53:30,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=829388.0, ans=0.125 2023-10-11 20:53:33,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=829434.6666666666, ans=0.2 2023-10-11 20:53:34,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-10-11 20:53:50,896 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-10-11 20:54:19,298 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.662e+02 1.850e+02 2.053e+02 2.754e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-11 20:54:20,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=829621.3333333334, ans=0.125 2023-10-11 20:54:25,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-10-11 20:54:56,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=829761.3333333334, ans=0.125 2023-10-11 20:55:08,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=829808.0, ans=0.125 2023-10-11 20:55:15,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=829808.0, ans=0.2 2023-10-11 20:55:21,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=829854.6666666666, ans=0.07 2023-10-11 20:55:23,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=829854.6666666666, ans=0.1 2023-10-11 20:55:45,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=829948.0, ans=0.125 2023-10-11 20:55:58,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=829994.6666666666, ans=0.125 2023-10-11 20:56:11,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.687e+02 1.850e+02 2.133e+02 2.932e+02, threshold=3.699e+02, percent-clipped=0.0 2023-10-11 20:56:15,713 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2023-10-11 20:56:21,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=830088.0, ans=0.125 2023-10-11 20:56:46,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=830228.0, ans=0.125 2023-10-11 20:56:47,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=830228.0, ans=0.2 2023-10-11 20:56:49,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=830228.0, ans=0.0 2023-10-11 20:57:18,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.71 vs. limit=15.0 2023-10-11 20:57:24,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=830368.0, ans=0.1 2023-10-11 20:57:37,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=830414.6666666666, ans=0.1 2023-10-11 20:57:55,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=830508.0, ans=0.125 2023-10-11 20:57:57,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=830508.0, ans=0.1 2023-10-11 20:58:02,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.794e+02 2.063e+02 2.429e+02 3.388e+02, threshold=4.125e+02, percent-clipped=0.0 2023-10-11 20:58:07,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=830554.6666666666, ans=0.0 2023-10-11 20:58:08,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=830554.6666666666, ans=0.125 2023-10-11 20:58:13,444 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 20:58:18,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=830601.3333333334, ans=0.125 2023-10-11 20:58:22,466 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.32 vs. limit=22.5 2023-10-11 20:58:28,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=830648.0, ans=0.125 2023-10-11 20:58:38,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.51 vs. limit=15.0 2023-10-11 20:58:51,641 INFO [train.py:1031] (0/4) Epoch 14, batch 500, loss[loss=0.1975, simple_loss=0.2889, pruned_loss=0.053, over 16725.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2893, pruned_loss=0.05485, over 7296232.65 frames. ], batch size: 61, lr: 2.56e-03, grad_scale: 32.0 2023-10-11 20:58:56,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=830741.3333333334, ans=0.125 2023-10-11 20:59:01,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=830788.0, ans=0.015 2023-10-11 20:59:03,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=830788.0, ans=0.125 2023-10-11 20:59:11,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=830788.0, ans=0.125 2023-10-11 20:59:25,661 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=15.0 2023-10-11 20:59:52,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=830974.6666666666, ans=0.04949747468305833 2023-10-11 20:59:54,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.703e+02 1.898e+02 2.181e+02 3.303e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-11 20:59:56,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=831021.3333333334, ans=0.125 2023-10-11 20:59:57,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=831021.3333333334, ans=0.5 2023-10-11 21:00:17,997 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.58 vs. limit=6.0 2023-10-11 21:00:24,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=831114.6666666666, ans=0.02 2023-10-11 21:00:58,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=831254.6666666666, ans=0.2 2023-10-11 21:01:07,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=831301.3333333334, ans=0.0 2023-10-11 21:01:14,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=831301.3333333334, ans=0.0 2023-10-11 21:01:16,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=831348.0, ans=6.0 2023-10-11 21:01:23,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=831348.0, ans=0.0 2023-10-11 21:01:31,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=831394.6666666666, ans=0.04949747468305833 2023-10-11 21:01:33,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=831394.6666666666, ans=0.125 2023-10-11 21:01:49,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.712e+02 1.886e+02 2.064e+02 3.139e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-11 21:02:07,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=831534.6666666666, ans=0.1 2023-10-11 21:02:51,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=831721.3333333334, ans=0.125 2023-10-11 21:02:55,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=831768.0, ans=0.09899494936611666 2023-10-11 21:03:17,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=831814.6666666666, ans=0.0 2023-10-11 21:03:30,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=831861.3333333334, ans=0.1 2023-10-11 21:03:40,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=831908.0, ans=0.125 2023-10-11 21:03:41,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.735e+02 1.856e+02 2.090e+02 2.929e+02, threshold=3.713e+02, percent-clipped=0.0 2023-10-11 21:03:42,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=831954.6666666666, ans=0.0 2023-10-11 21:03:46,719 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.70 vs. limit=15.0 2023-10-11 21:03:58,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=832001.3333333334, ans=0.125 2023-10-11 21:04:11,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=832048.0, ans=0.1 2023-10-11 21:04:13,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=832048.0, ans=0.1 2023-10-11 21:04:15,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=832094.6666666666, ans=0.125 2023-10-11 21:04:45,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.78 vs. limit=10.0 2023-10-11 21:04:48,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=832234.6666666666, ans=0.2 2023-10-11 21:04:50,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=832234.6666666666, ans=0.1 2023-10-11 21:05:38,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.717e+02 1.854e+02 2.077e+02 2.627e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-11 21:05:44,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=832421.3333333334, ans=0.1 2023-10-11 21:06:10,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=832514.6666666666, ans=0.125 2023-10-11 21:06:16,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.79 vs. limit=15.0 2023-10-11 21:06:33,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=832608.0, ans=0.1 2023-10-11 21:06:44,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=832654.6666666666, ans=0.2 2023-10-11 21:06:58,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=832748.0, ans=0.2 2023-10-11 21:07:09,222 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:07:18,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=832794.6666666666, ans=0.125 2023-10-11 21:07:34,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.619e+02 1.755e+02 1.906e+02 3.111e+02, threshold=3.510e+02, percent-clipped=0.0 2023-10-11 21:07:58,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=832934.6666666666, ans=0.125 2023-10-11 21:08:02,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=15.0 2023-10-11 21:08:02,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=832981.3333333334, ans=0.125 2023-10-11 21:08:21,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=833028.0, ans=0.0 2023-10-11 21:08:24,596 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:08:25,184 INFO [train.py:1031] (0/4) Epoch 14, batch 1000, loss[loss=0.2202, simple_loss=0.3038, pruned_loss=0.06827, over 16723.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2892, pruned_loss=0.05477, over 12960235.51 frames. ], batch size: 56, lr: 2.55e-03, grad_scale: 32.0 2023-10-11 21:08:27,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=833074.6666666666, ans=0.0 2023-10-11 21:08:40,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=833121.3333333334, ans=0.2 2023-10-11 21:08:43,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=833121.3333333334, ans=0.1 2023-10-11 21:08:44,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=833121.3333333334, ans=0.125 2023-10-11 21:08:55,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=833168.0, ans=0.125 2023-10-11 21:09:05,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=833214.6666666666, ans=0.0 2023-10-11 21:09:05,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=833214.6666666666, ans=0.125 2023-10-11 21:09:09,051 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-10-11 21:09:15,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=833261.3333333334, ans=0.0 2023-10-11 21:09:24,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=833308.0, ans=0.1 2023-10-11 21:09:29,398 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.660e+02 1.812e+02 2.035e+02 2.588e+02, threshold=3.625e+02, percent-clipped=0.0 2023-10-11 21:09:38,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=833354.6666666666, ans=0.0 2023-10-11 21:09:52,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=833448.0, ans=0.125 2023-10-11 21:10:21,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.11 vs. limit=22.5 2023-10-11 21:10:35,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=833588.0, ans=0.0 2023-10-11 21:10:54,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=833681.3333333334, ans=0.1 2023-10-11 21:10:54,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=833681.3333333334, ans=0.0 2023-10-11 21:11:18,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=833774.6666666666, ans=0.125 2023-10-11 21:11:24,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.334e+02 1.786e+02 2.012e+02 2.341e+02 3.564e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-11 21:12:01,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=833914.6666666666, ans=0.125 2023-10-11 21:12:04,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=833914.6666666666, ans=0.125 2023-10-11 21:12:39,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-10-11 21:12:44,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=834101.3333333334, ans=0.02 2023-10-11 21:12:50,808 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:12:56,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.45 vs. limit=12.0 2023-10-11 21:13:28,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.648e+02 1.837e+02 2.021e+02 2.785e+02, threshold=3.675e+02, percent-clipped=0.0 2023-10-11 21:13:34,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=834288.0, ans=0.0 2023-10-11 21:13:43,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=834334.6666666666, ans=22.5 2023-10-11 21:13:48,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=834334.6666666666, ans=0.125 2023-10-11 21:13:50,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=834381.3333333334, ans=0.2 2023-10-11 21:13:50,820 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=15.0 2023-10-11 21:13:51,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=834381.3333333334, ans=0.125 2023-10-11 21:13:56,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=834381.3333333334, ans=0.2 2023-10-11 21:14:00,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.23 vs. limit=15.0 2023-10-11 21:14:12,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=834474.6666666666, ans=0.0 2023-10-11 21:14:22,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=834521.3333333334, ans=0.125 2023-10-11 21:14:26,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=834521.3333333334, ans=0.125 2023-10-11 21:14:55,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=834661.3333333334, ans=0.125 2023-10-11 21:15:01,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=834661.3333333334, ans=0.0 2023-10-11 21:15:11,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=834708.0, ans=0.125 2023-10-11 21:15:14,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.786e+02 2.004e+02 2.395e+02 3.657e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-11 21:15:15,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=834754.6666666666, ans=0.0 2023-10-11 21:15:15,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=834754.6666666666, ans=0.2 2023-10-11 21:15:30,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=834801.3333333334, ans=0.125 2023-10-11 21:15:33,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=834801.3333333334, ans=0.0 2023-10-11 21:15:47,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=834848.0, ans=0.125 2023-10-11 21:16:03,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=834894.6666666666, ans=0.0 2023-10-11 21:16:18,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=834941.3333333334, ans=0.1 2023-10-11 21:16:18,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=834941.3333333334, ans=0.125 2023-10-11 21:16:27,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=834988.0, ans=0.5 2023-10-11 21:16:27,950 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=22.5 2023-10-11 21:16:37,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=835034.6666666666, ans=0.0 2023-10-11 21:16:42,151 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.08 vs. limit=15.0 2023-10-11 21:16:52,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=835081.3333333334, ans=0.125 2023-10-11 21:17:22,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.722e+02 1.882e+02 2.150e+02 2.876e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-11 21:17:31,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=835221.3333333334, ans=0.125 2023-10-11 21:17:48,339 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:18:07,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=835361.3333333334, ans=0.1 2023-10-11 21:18:07,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.84 vs. limit=15.0 2023-10-11 21:18:08,837 INFO [train.py:1031] (0/4) Epoch 14, batch 1500, loss[loss=0.1923, simple_loss=0.2885, pruned_loss=0.04808, over 16636.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2874, pruned_loss=0.05391, over 17352532.00 frames. ], batch size: 202, lr: 2.55e-03, grad_scale: 16.0 2023-10-11 21:18:10,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=835408.0, ans=0.0 2023-10-11 21:18:48,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=835548.0, ans=0.025 2023-10-11 21:19:00,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=835594.6666666666, ans=0.125 2023-10-11 21:19:13,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.93 vs. limit=22.5 2023-10-11 21:19:15,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=835641.3333333334, ans=0.125 2023-10-11 21:19:16,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.664e+02 1.873e+02 2.074e+02 2.666e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-11 21:19:23,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=835688.0, ans=0.09899494936611666 2023-10-11 21:19:25,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=835688.0, ans=0.0 2023-10-11 21:19:26,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=835734.6666666666, ans=0.0 2023-10-11 21:19:38,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=835781.3333333334, ans=0.95 2023-10-11 21:19:43,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=835781.3333333334, ans=0.125 2023-10-11 21:19:53,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=835828.0, ans=0.0 2023-10-11 21:20:00,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=835874.6666666666, ans=0.125 2023-10-11 21:20:27,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=835968.0, ans=0.1 2023-10-11 21:20:39,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=836014.6666666666, ans=0.125 2023-10-11 21:21:00,031 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-10-11 21:21:02,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=836061.3333333334, ans=0.125 2023-10-11 21:21:06,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=836108.0, ans=0.125 2023-10-11 21:21:12,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=836108.0, ans=0.125 2023-10-11 21:21:15,609 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.688e+02 1.815e+02 1.975e+02 2.958e+02, threshold=3.630e+02, percent-clipped=0.0 2023-10-11 21:21:40,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=836248.0, ans=0.0 2023-10-11 21:21:41,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=836248.0, ans=0.2 2023-10-11 21:21:53,896 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:22:30,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=836434.6666666666, ans=0.125 2023-10-11 21:22:31,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=836481.3333333334, ans=0.125 2023-10-11 21:23:05,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.689e+02 1.892e+02 2.084e+02 3.451e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-11 21:23:07,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=836621.3333333334, ans=0.0 2023-10-11 21:23:27,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=836714.6666666666, ans=0.125 2023-10-11 21:23:39,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-10-11 21:23:47,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.64 vs. limit=10.0 2023-10-11 21:23:55,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=836808.0, ans=0.125 2023-10-11 21:24:06,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=836854.6666666666, ans=0.2 2023-10-11 21:24:08,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=836854.6666666666, ans=0.125 2023-10-11 21:24:19,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=836901.3333333334, ans=0.1 2023-10-11 21:24:34,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=836948.0, ans=0.125 2023-10-11 21:24:47,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=836994.6666666666, ans=0.1 2023-10-11 21:25:01,160 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=12.0 2023-10-11 21:25:01,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.709e+02 1.885e+02 2.107e+02 3.233e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-11 21:25:48,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=837274.6666666666, ans=0.125 2023-10-11 21:26:00,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=837321.3333333334, ans=0.0 2023-10-11 21:26:10,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=837368.0, ans=0.125 2023-10-11 21:26:11,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=837368.0, ans=0.125 2023-10-11 21:26:37,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=837461.3333333334, ans=0.125 2023-10-11 21:27:00,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.714e+02 1.865e+02 2.110e+02 3.389e+02, threshold=3.730e+02, percent-clipped=0.0 2023-10-11 21:27:01,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.87 vs. limit=5.0 2023-10-11 21:27:29,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=837648.0, ans=0.0 2023-10-11 21:27:29,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=837648.0, ans=0.04949747468305833 2023-10-11 21:27:30,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=837648.0, ans=0.125 2023-10-11 21:27:35,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=837694.6666666666, ans=0.05 2023-10-11 21:27:47,619 INFO [train.py:1031] (0/4) Epoch 14, batch 2000, loss[loss=0.2057, simple_loss=0.2917, pruned_loss=0.05984, over 16523.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2879, pruned_loss=0.05388, over 20813256.64 frames. ], batch size: 56, lr: 2.55e-03, grad_scale: 32.0 2023-10-11 21:27:55,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=837741.3333333334, ans=0.1 2023-10-11 21:27:59,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=837788.0, ans=0.0 2023-10-11 21:28:19,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=837834.6666666666, ans=0.0 2023-10-11 21:28:37,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=837881.3333333334, ans=0.0 2023-10-11 21:28:48,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=837928.0, ans=0.1 2023-10-11 21:28:49,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=837928.0, ans=0.125 2023-10-11 21:28:52,009 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-10-11 21:29:07,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.309e+02 1.698e+02 1.844e+02 2.072e+02 3.190e+02, threshold=3.687e+02, percent-clipped=0.0 2023-10-11 21:29:07,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=838021.3333333334, ans=0.0 2023-10-11 21:29:20,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=838068.0, ans=0.125 2023-10-11 21:29:42,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=838161.3333333334, ans=0.2 2023-10-11 21:30:09,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=838208.0, ans=0.0 2023-10-11 21:30:12,000 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.47 vs. limit=10.0 2023-10-11 21:30:43,529 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.19 vs. limit=10.0 2023-10-11 21:31:03,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=838394.6666666666, ans=0.125 2023-10-11 21:31:03,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=838394.6666666666, ans=0.0 2023-10-11 21:31:20,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=838441.3333333334, ans=0.125 2023-10-11 21:31:25,829 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.686e+02 1.873e+02 2.143e+02 2.836e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-11 21:31:43,963 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:32:05,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=838628.0, ans=0.125 2023-10-11 21:32:18,348 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.06 vs. limit=15.0 2023-10-11 21:32:18,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=838674.6666666666, ans=0.125 2023-10-11 21:32:25,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-10-11 21:32:33,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=838721.3333333334, ans=0.125 2023-10-11 21:33:03,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=838861.3333333334, ans=0.2 2023-10-11 21:33:11,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=838908.0, ans=0.09899494936611666 2023-10-11 21:33:14,525 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.36 vs. limit=22.5 2023-10-11 21:33:15,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=838908.0, ans=0.1 2023-10-11 21:33:20,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.842e+02 1.984e+02 2.247e+02 3.185e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-11 21:33:29,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=839001.3333333334, ans=0.2 2023-10-11 21:33:35,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=839001.3333333334, ans=0.0 2023-10-11 21:33:54,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=839094.6666666666, ans=0.125 2023-10-11 21:33:57,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=839094.6666666666, ans=0.125 2023-10-11 21:34:07,517 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=15.0 2023-10-11 21:34:10,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=839141.3333333334, ans=0.125 2023-10-11 21:34:21,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=839188.0, ans=0.125 2023-10-11 21:34:52,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=839328.0, ans=0.125 2023-10-11 21:34:59,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=839374.6666666666, ans=0.0 2023-10-11 21:35:09,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.800e+02 1.980e+02 2.176e+02 3.241e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-11 21:35:14,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=839421.3333333334, ans=0.125 2023-10-11 21:35:33,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=839514.6666666666, ans=0.2 2023-10-11 21:35:36,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=839514.6666666666, ans=0.0 2023-10-11 21:35:40,352 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-10-11 21:36:05,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=839654.6666666666, ans=0.125 2023-10-11 21:36:36,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=839794.6666666666, ans=0.1 2023-10-11 21:36:56,477 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.786e+02 1.939e+02 2.166e+02 2.700e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-11 21:37:01,037 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-10-11 21:37:23,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.62 vs. limit=15.0 2023-10-11 21:37:38,730 INFO [train.py:1031] (0/4) Epoch 14, batch 2500, loss[loss=0.1955, simple_loss=0.2873, pruned_loss=0.05184, over 16275.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2881, pruned_loss=0.05394, over 23488105.89 frames. ], batch size: 50, lr: 2.54e-03, grad_scale: 32.0 2023-10-11 21:37:40,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=840074.6666666666, ans=0.125 2023-10-11 21:37:45,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=840074.6666666666, ans=0.0 2023-10-11 21:37:47,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=840074.6666666666, ans=22.5 2023-10-11 21:37:47,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=840121.3333333334, ans=0.0 2023-10-11 21:37:55,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=840121.3333333334, ans=0.95 2023-10-11 21:38:09,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=840214.6666666666, ans=0.05 2023-10-11 21:38:10,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.12 vs. limit=15.0 2023-10-11 21:38:18,677 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-10-11 21:38:19,655 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-10-11 21:38:27,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-10-11 21:38:37,459 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.61 vs. limit=15.0 2023-10-11 21:38:44,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.785e+02 1.981e+02 2.360e+02 2.983e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-11 21:38:48,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=840354.6666666666, ans=0.04949747468305833 2023-10-11 21:38:56,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=840401.3333333334, ans=0.0 2023-10-11 21:38:58,017 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.09 vs. limit=10.0 2023-10-11 21:39:55,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=840634.6666666666, ans=0.125 2023-10-11 21:39:58,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=12.0 2023-10-11 21:40:05,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=840681.3333333334, ans=0.125 2023-10-11 21:40:06,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=840681.3333333334, ans=0.0 2023-10-11 21:40:21,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=840774.6666666666, ans=0.125 2023-10-11 21:40:24,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=840774.6666666666, ans=0.1 2023-10-11 21:40:34,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.734e+02 1.918e+02 2.136e+02 3.051e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-11 21:40:38,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=840821.3333333334, ans=0.125 2023-10-11 21:40:39,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=840821.3333333334, ans=0.125 2023-10-11 21:40:40,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=840821.3333333334, ans=0.025 2023-10-11 21:40:44,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=840868.0, ans=0.1 2023-10-11 21:40:58,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=840914.6666666666, ans=0.0 2023-10-11 21:41:15,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=841008.0, ans=0.125 2023-10-11 21:41:18,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=841008.0, ans=0.2 2023-10-11 21:41:38,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=841054.6666666666, ans=0.125 2023-10-11 21:41:49,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=841101.3333333334, ans=0.125 2023-10-11 21:41:55,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=841148.0, ans=0.125 2023-10-11 21:41:58,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=841148.0, ans=0.05 2023-10-11 21:42:07,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=841194.6666666666, ans=0.05 2023-10-11 21:42:13,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=841194.6666666666, ans=0.125 2023-10-11 21:42:20,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=841241.3333333334, ans=0.125 2023-10-11 21:42:37,343 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.00 vs. limit=15.0 2023-10-11 21:42:37,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.707e+02 1.867e+02 2.076e+02 2.903e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-11 21:42:54,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=841334.6666666666, ans=0.2 2023-10-11 21:42:57,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=841334.6666666666, ans=0.125 2023-10-11 21:43:03,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=841381.3333333334, ans=0.0 2023-10-11 21:43:04,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=841381.3333333334, ans=0.125 2023-10-11 21:43:04,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=841381.3333333334, ans=0.125 2023-10-11 21:43:06,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=841381.3333333334, ans=0.1 2023-10-11 21:43:22,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=841428.0, ans=0.125 2023-10-11 21:43:31,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=841474.6666666666, ans=0.0 2023-10-11 21:44:04,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=841614.6666666666, ans=0.125 2023-10-11 21:44:22,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=841661.3333333334, ans=0.2 2023-10-11 21:44:28,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=841708.0, ans=0.2 2023-10-11 21:44:39,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.663e+02 1.846e+02 2.146e+02 2.837e+02, threshold=3.692e+02, percent-clipped=0.0 2023-10-11 21:44:47,237 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.83 vs. limit=15.0 2023-10-11 21:45:08,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=841848.0, ans=0.0 2023-10-11 21:45:09,339 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.01 vs. limit=10.0 2023-10-11 21:46:04,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=842081.3333333334, ans=0.125 2023-10-11 21:46:14,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=842081.3333333334, ans=0.2 2023-10-11 21:46:17,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=842128.0, ans=0.125 2023-10-11 21:46:18,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=842128.0, ans=0.2 2023-10-11 21:46:41,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.327e+02 1.644e+02 1.822e+02 2.006e+02 2.815e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-11 21:46:47,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=842221.3333333334, ans=0.1 2023-10-11 21:46:53,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=842268.0, ans=0.125 2023-10-11 21:46:58,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=842268.0, ans=0.125 2023-10-11 21:47:09,513 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.79 vs. limit=10.0 2023-10-11 21:47:14,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=842361.3333333334, ans=0.07 2023-10-11 21:47:24,914 INFO [train.py:1031] (0/4) Epoch 14, batch 3000, loss[loss=0.1849, simple_loss=0.2776, pruned_loss=0.04613, over 16691.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2872, pruned_loss=0.0537, over 25551515.46 frames. ], batch size: 220, lr: 2.54e-03, grad_scale: 32.0 2023-10-11 21:47:29,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.34 vs. limit=15.0 2023-10-11 21:47:31,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=842408.0, ans=0.125 2023-10-11 21:47:43,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=842454.6666666666, ans=0.125 2023-10-11 21:47:49,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=842501.3333333334, ans=0.1 2023-10-11 21:48:00,466 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.56 vs. limit=15.0 2023-10-11 21:48:26,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=842641.3333333334, ans=0.0 2023-10-11 21:48:27,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=842641.3333333334, ans=0.2 2023-10-11 21:48:34,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=842688.0, ans=0.04949747468305833 2023-10-11 21:48:35,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.842e+02 2.010e+02 2.218e+02 3.047e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-11 21:48:40,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=842688.0, ans=0.07 2023-10-11 21:48:46,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=842734.6666666666, ans=0.125 2023-10-11 21:48:52,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.18 vs. limit=15.0 2023-10-11 21:50:07,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=843014.6666666666, ans=0.1 2023-10-11 21:50:24,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=843108.0, ans=0.125 2023-10-11 21:50:33,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.699e+02 1.854e+02 2.054e+02 2.717e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-11 21:50:46,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=843201.3333333334, ans=0.0 2023-10-11 21:50:57,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=843248.0, ans=0.125 2023-10-11 21:51:13,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=843294.6666666666, ans=0.125 2023-10-11 21:51:58,745 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.50 vs. limit=15.0 2023-10-11 21:52:07,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=843481.3333333334, ans=0.1 2023-10-11 21:52:21,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=843574.6666666666, ans=0.125 2023-10-11 21:52:22,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=843574.6666666666, ans=0.0 2023-10-11 21:52:40,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.815e+02 2.048e+02 2.381e+02 3.871e+02, threshold=4.096e+02, percent-clipped=1.0 2023-10-11 21:52:42,653 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-10-11 21:52:46,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=843621.3333333334, ans=10.0 2023-10-11 21:53:08,570 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:53:21,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=843761.3333333334, ans=0.125 2023-10-11 21:53:25,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=843808.0, ans=0.1 2023-10-11 21:53:33,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=843808.0, ans=0.0 2023-10-11 21:53:47,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.42 vs. limit=22.5 2023-10-11 21:54:10,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=843994.6666666666, ans=0.0 2023-10-11 21:54:32,988 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.721e+02 1.955e+02 2.202e+02 3.303e+02, threshold=3.911e+02, percent-clipped=0.0 2023-10-11 21:54:34,977 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:55:06,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=844181.3333333334, ans=0.125 2023-10-11 21:55:17,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=844228.0, ans=0.125 2023-10-11 21:55:27,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=844274.6666666666, ans=0.125 2023-10-11 21:55:32,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=844321.3333333334, ans=0.1 2023-10-11 21:55:48,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=844368.0, ans=0.025 2023-10-11 21:55:58,295 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.49 vs. limit=15.0 2023-10-11 21:56:08,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=844461.3333333334, ans=0.125 2023-10-11 21:56:29,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.713e+02 1.863e+02 2.076e+02 2.541e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-11 21:56:31,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.83 vs. limit=10.0 2023-10-11 21:56:33,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=844554.6666666666, ans=0.1 2023-10-11 21:56:35,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.28 vs. limit=5.0 2023-10-11 21:56:35,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=844554.6666666666, ans=0.125 2023-10-11 21:56:51,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=844648.0, ans=0.0 2023-10-11 21:57:05,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=844694.6666666666, ans=0.0 2023-10-11 21:57:05,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=844694.6666666666, ans=0.125 2023-10-11 21:57:14,516 INFO [train.py:1031] (0/4) Epoch 14, batch 3500, loss[loss=0.2254, simple_loss=0.308, pruned_loss=0.07145, over 16633.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.287, pruned_loss=0.0539, over 27142630.11 frames. ], batch size: 241, lr: 2.54e-03, grad_scale: 16.0 2023-10-11 21:57:17,715 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 21:57:18,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=844741.3333333334, ans=0.0 2023-10-11 21:57:42,612 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-10-11 21:57:45,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=844881.3333333334, ans=0.125 2023-10-11 21:57:56,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=844881.3333333334, ans=0.125 2023-10-11 21:57:59,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=844928.0, ans=0.0 2023-10-11 21:58:19,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=845021.3333333334, ans=0.125 2023-10-11 21:58:21,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.770e+02 1.936e+02 2.082e+02 2.703e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-11 21:58:29,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=845021.3333333334, ans=0.0 2023-10-11 21:58:46,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=845068.0, ans=0.1 2023-10-11 21:58:54,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2023-10-11 21:59:04,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=845161.3333333334, ans=0.125 2023-10-11 21:59:04,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=845161.3333333334, ans=0.125 2023-10-11 21:59:07,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=845161.3333333334, ans=0.1 2023-10-11 21:59:10,523 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.47 vs. limit=15.0 2023-10-11 21:59:19,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=845254.6666666666, ans=0.125 2023-10-11 21:59:20,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.08 vs. limit=10.0 2023-10-11 21:59:56,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=845394.6666666666, ans=0.1 2023-10-11 22:00:12,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.03 vs. limit=15.0 2023-10-11 22:00:17,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=845488.0, ans=0.0 2023-10-11 22:00:17,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=845488.0, ans=0.1 2023-10-11 22:00:20,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.754e+02 1.892e+02 2.190e+02 3.191e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-11 22:00:23,815 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-10-11 22:00:43,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=845581.3333333334, ans=0.125 2023-10-11 22:00:51,471 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.07 vs. limit=15.0 2023-10-11 22:01:00,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.98 vs. limit=15.0 2023-10-11 22:01:23,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=845721.3333333334, ans=0.125 2023-10-11 22:01:39,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=845768.0, ans=0.0 2023-10-11 22:01:52,523 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.84 vs. limit=15.0 2023-10-11 22:02:04,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=845861.3333333334, ans=0.125 2023-10-11 22:02:19,587 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.777e+02 1.932e+02 2.184e+02 3.257e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-11 22:02:24,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=845954.6666666666, ans=0.0 2023-10-11 22:02:27,523 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:02:31,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=846001.3333333334, ans=0.0 2023-10-11 22:02:46,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=846048.0, ans=0.0 2023-10-11 22:02:58,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=846094.6666666666, ans=0.07 2023-10-11 22:03:09,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=846141.3333333334, ans=0.125 2023-10-11 22:03:18,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=846188.0, ans=0.035 2023-10-11 22:03:24,913 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:03:26,626 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.92 vs. limit=22.5 2023-10-11 22:03:31,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=846234.6666666666, ans=0.125 2023-10-11 22:03:45,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=846281.3333333334, ans=0.125 2023-10-11 22:04:04,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=846374.6666666666, ans=0.0 2023-10-11 22:04:07,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=846374.6666666666, ans=0.125 2023-10-11 22:04:12,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=846421.3333333334, ans=0.2 2023-10-11 22:04:13,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.673e+02 1.827e+02 2.027e+02 3.473e+02, threshold=3.654e+02, percent-clipped=0.0 2023-10-11 22:04:20,330 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-11 22:04:41,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=846514.6666666666, ans=0.2 2023-10-11 22:04:54,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=846561.3333333334, ans=0.125 2023-10-11 22:04:58,292 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.08 vs. limit=15.0 2023-10-11 22:05:15,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=846654.6666666666, ans=0.1 2023-10-11 22:05:52,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=846841.3333333334, ans=0.125 2023-10-11 22:05:59,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=846888.0, ans=0.0 2023-10-11 22:06:02,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.683e+02 1.826e+02 2.205e+02 2.978e+02, threshold=3.653e+02, percent-clipped=0.0 2023-10-11 22:06:11,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=846934.6666666666, ans=15.0 2023-10-11 22:06:15,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=846934.6666666666, ans=0.125 2023-10-11 22:06:42,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=847028.0, ans=0.125 2023-10-11 22:06:44,789 INFO [train.py:1031] (0/4) Epoch 14, batch 4000, loss[loss=0.195, simple_loss=0.2884, pruned_loss=0.05086, over 16796.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2868, pruned_loss=0.05395, over 28414831.07 frames. ], batch size: 146, lr: 2.53e-03, grad_scale: 32.0 2023-10-11 22:06:57,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=847121.3333333334, ans=0.2 2023-10-11 22:07:00,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=847121.3333333334, ans=0.0 2023-10-11 22:07:06,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=847121.3333333334, ans=0.2 2023-10-11 22:07:20,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=847214.6666666666, ans=0.1 2023-10-11 22:07:29,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=847214.6666666666, ans=0.125 2023-10-11 22:07:46,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=847308.0, ans=15.0 2023-10-11 22:07:58,751 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.798e+02 1.954e+02 2.171e+02 2.909e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-11 22:08:09,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-10-11 22:08:27,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-10-11 22:08:40,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=847541.3333333334, ans=0.125 2023-10-11 22:08:59,169 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.93 vs. limit=10.0 2023-10-11 22:09:50,317 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.705e+02 1.890e+02 2.091e+02 3.406e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-11 22:09:54,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=847821.3333333334, ans=0.1 2023-10-11 22:10:08,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=847868.0, ans=0.125 2023-10-11 22:10:11,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=847868.0, ans=0.0 2023-10-11 22:10:16,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=847868.0, ans=0.95 2023-10-11 22:10:37,989 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:10:43,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=847961.3333333334, ans=0.0 2023-10-11 22:10:54,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=848008.0, ans=0.0 2023-10-11 22:10:54,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=848008.0, ans=0.0 2023-10-11 22:11:02,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=848054.6666666666, ans=0.125 2023-10-11 22:11:26,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=848148.0, ans=0.07 2023-10-11 22:11:28,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=848148.0, ans=0.07 2023-10-11 22:11:33,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=848194.6666666666, ans=0.2 2023-10-11 22:11:43,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=848194.6666666666, ans=0.125 2023-10-11 22:11:58,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.606e+02 1.787e+02 2.023e+02 2.818e+02, threshold=3.574e+02, percent-clipped=0.0 2023-10-11 22:11:59,695 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-10-11 22:12:27,357 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:12:36,398 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=15.0 2023-10-11 22:12:38,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=848474.6666666666, ans=0.0 2023-10-11 22:12:52,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=848521.3333333334, ans=0.125 2023-10-11 22:12:59,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=848521.3333333334, ans=0.125 2023-10-11 22:13:12,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=848614.6666666666, ans=0.0 2023-10-11 22:13:31,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=848661.3333333334, ans=0.0 2023-10-11 22:13:45,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=848754.6666666666, ans=0.0 2023-10-11 22:13:48,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.773e+02 1.935e+02 2.158e+02 3.067e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-11 22:13:54,167 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:14:09,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.24 vs. limit=6.0 2023-10-11 22:14:10,383 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.27 vs. limit=10.0 2023-10-11 22:14:48,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=848988.0, ans=0.125 2023-10-11 22:14:51,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=848988.0, ans=0.07 2023-10-11 22:14:59,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=849034.6666666666, ans=0.0 2023-10-11 22:15:33,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=849128.0, ans=0.125 2023-10-11 22:15:54,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.750e+02 1.895e+02 2.116e+02 3.117e+02, threshold=3.791e+02, percent-clipped=0.0 2023-10-11 22:15:55,535 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:16:12,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=849314.6666666666, ans=0.125 2023-10-11 22:16:35,065 INFO [train.py:1031] (0/4) Epoch 14, batch 4500, loss[loss=0.1904, simple_loss=0.2838, pruned_loss=0.04845, over 16971.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2873, pruned_loss=0.05384, over 29411757.06 frames. ], batch size: 82, lr: 2.53e-03, grad_scale: 32.0 2023-10-11 22:16:41,507 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:17:00,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=849501.3333333334, ans=0.125 2023-10-11 22:17:20,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=849594.6666666666, ans=0.0 2023-10-11 22:17:26,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=849594.6666666666, ans=0.125 2023-10-11 22:17:27,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=849594.6666666666, ans=0.125 2023-10-11 22:17:41,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.694e+02 1.877e+02 2.131e+02 3.021e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-11 22:17:45,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=849688.0, ans=0.2 2023-10-11 22:18:39,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=849921.3333333334, ans=0.125 2023-10-11 22:18:43,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.17 vs. limit=15.0 2023-10-11 22:18:49,217 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.25 vs. limit=15.0 2023-10-11 22:18:55,139 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.01 vs. limit=15.0 2023-10-11 22:19:16,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=850108.0, ans=0.0 2023-10-11 22:19:21,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=850108.0, ans=10.0 2023-10-11 22:19:21,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=850108.0, ans=22.5 2023-10-11 22:19:28,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.756e+02 1.950e+02 2.181e+02 2.976e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-11 22:20:01,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=850294.6666666666, ans=0.125 2023-10-11 22:20:09,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=850341.3333333334, ans=0.125 2023-10-11 22:20:10,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=850341.3333333334, ans=0.1 2023-10-11 22:20:16,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=850341.3333333334, ans=0.0 2023-10-11 22:20:43,549 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:20:59,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=850528.0, ans=0.125 2023-10-11 22:21:10,047 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.74 vs. limit=22.5 2023-10-11 22:21:15,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=850621.3333333334, ans=0.04949747468305833 2023-10-11 22:21:16,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=850621.3333333334, ans=0.125 2023-10-11 22:21:18,588 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.738e+02 1.942e+02 2.238e+02 3.168e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-11 22:21:31,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=850668.0, ans=0.125 2023-10-11 22:21:36,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=850714.6666666666, ans=0.0 2023-10-11 22:21:43,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.81 vs. limit=22.5 2023-10-11 22:21:44,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=850714.6666666666, ans=0.0 2023-10-11 22:22:04,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=850808.0, ans=0.0 2023-10-11 22:22:15,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=850854.6666666666, ans=0.125 2023-10-11 22:22:32,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=850948.0, ans=0.125 2023-10-11 22:22:50,581 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:23:08,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=851088.0, ans=0.125 2023-10-11 22:23:11,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=851088.0, ans=0.0 2023-10-11 22:23:12,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.774e+02 1.995e+02 2.118e+02 3.240e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-11 22:23:16,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.83 vs. limit=12.0 2023-10-11 22:23:21,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=851134.6666666666, ans=0.07 2023-10-11 22:23:22,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=851134.6666666666, ans=0.2 2023-10-11 22:23:23,413 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.35 vs. limit=15.0 2023-10-11 22:23:29,350 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-10-11 22:23:51,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=851274.6666666666, ans=0.0 2023-10-11 22:24:40,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=851461.3333333334, ans=0.1 2023-10-11 22:24:58,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=851508.0, ans=0.125 2023-10-11 22:25:07,262 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.698e+02 1.886e+02 2.130e+02 2.962e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-11 22:25:24,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=851601.3333333334, ans=0.0 2023-10-11 22:25:24,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=851601.3333333334, ans=0.0 2023-10-11 22:25:44,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=851694.6666666666, ans=0.125 2023-10-11 22:25:48,271 INFO [train.py:1031] (0/4) Epoch 14, batch 5000, loss[loss=0.1968, simple_loss=0.2553, pruned_loss=0.06914, over 12448.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.287, pruned_loss=0.05385, over 30192657.91 frames. ], batch size: 440, lr: 2.53e-03, grad_scale: 32.0 2023-10-11 22:26:01,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=851788.0, ans=0.125 2023-10-11 22:26:06,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=851788.0, ans=0.1 2023-10-11 22:26:17,347 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:26:23,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=851881.3333333334, ans=0.125 2023-10-11 22:26:34,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=851928.0, ans=0.2 2023-10-11 22:26:37,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=851928.0, ans=0.125 2023-10-11 22:26:47,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=851974.6666666666, ans=10.0 2023-10-11 22:26:51,522 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.83 vs. limit=10.0 2023-10-11 22:26:57,073 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.782e+02 2.029e+02 2.371e+02 3.765e+02, threshold=4.058e+02, percent-clipped=0.0 2023-10-11 22:27:07,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=852068.0, ans=0.1 2023-10-11 22:27:10,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.04 vs. limit=15.0 2023-10-11 22:27:17,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=852114.6666666666, ans=0.1 2023-10-11 22:27:25,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=852161.3333333334, ans=0.05 2023-10-11 22:27:25,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=852161.3333333334, ans=0.125 2023-10-11 22:27:30,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=852161.3333333334, ans=0.2 2023-10-11 22:27:32,423 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-10-11 22:27:35,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=852161.3333333334, ans=0.125 2023-10-11 22:28:14,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=852301.3333333334, ans=0.0 2023-10-11 22:28:30,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=852394.6666666666, ans=0.0 2023-10-11 22:28:39,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=852394.6666666666, ans=0.125 2023-10-11 22:28:55,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.749e+02 1.932e+02 2.163e+02 2.902e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-11 22:28:58,859 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-10-11 22:29:16,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=852581.3333333334, ans=0.125 2023-10-11 22:29:23,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=852581.3333333334, ans=0.1 2023-10-11 22:29:36,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=852674.6666666666, ans=0.1 2023-10-11 22:29:59,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=852768.0, ans=0.125 2023-10-11 22:30:05,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=852768.0, ans=0.2 2023-10-11 22:30:08,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=852814.6666666666, ans=0.125 2023-10-11 22:30:09,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=852814.6666666666, ans=0.1 2023-10-11 22:30:13,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=852814.6666666666, ans=0.125 2023-10-11 22:30:16,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=852814.6666666666, ans=0.0 2023-10-11 22:30:27,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=852861.3333333334, ans=0.0 2023-10-11 22:30:47,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=852954.6666666666, ans=0.125 2023-10-11 22:30:49,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.718e+02 1.892e+02 2.094e+02 3.758e+02, threshold=3.784e+02, percent-clipped=0.0 2023-10-11 22:30:52,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=852954.6666666666, ans=0.125 2023-10-11 22:31:43,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=853188.0, ans=0.2 2023-10-11 22:31:48,223 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:31:49,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=853188.0, ans=0.125 2023-10-11 22:31:55,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=853234.6666666666, ans=22.5 2023-10-11 22:31:56,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=853234.6666666666, ans=0.1 2023-10-11 22:32:07,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=853281.3333333334, ans=0.0 2023-10-11 22:32:09,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=853281.3333333334, ans=0.2 2023-10-11 22:32:27,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=853328.0, ans=0.125 2023-10-11 22:32:36,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=853374.6666666666, ans=0.125 2023-10-11 22:32:39,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=853374.6666666666, ans=10.0 2023-10-11 22:32:44,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.688e+02 1.879e+02 2.172e+02 3.902e+02, threshold=3.758e+02, percent-clipped=1.0 2023-10-11 22:33:09,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=853514.6666666666, ans=0.0 2023-10-11 22:33:23,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=853608.0, ans=0.1 2023-10-11 22:33:28,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=853608.0, ans=0.125 2023-10-11 22:33:38,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=853654.6666666666, ans=0.125 2023-10-11 22:34:07,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.03 vs. limit=22.5 2023-10-11 22:34:09,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=8.0 2023-10-11 22:34:18,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=853841.3333333334, ans=0.125 2023-10-11 22:34:30,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=853888.0, ans=0.07 2023-10-11 22:34:31,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.678e+02 1.838e+02 2.112e+02 2.920e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-11 22:34:31,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=853888.0, ans=0.125 2023-10-11 22:34:36,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=853888.0, ans=0.2 2023-10-11 22:34:41,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=853934.6666666666, ans=0.125 2023-10-11 22:34:41,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=853934.6666666666, ans=0.125 2023-10-11 22:34:50,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=853981.3333333334, ans=0.1 2023-10-11 22:35:01,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=854028.0, ans=0.125 2023-10-11 22:35:02,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=854028.0, ans=0.0 2023-10-11 22:35:11,691 INFO [train.py:1031] (0/4) Epoch 14, batch 5500, loss[loss=0.2122, simple_loss=0.3033, pruned_loss=0.06051, over 16577.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2869, pruned_loss=0.05396, over 30752667.06 frames. ], batch size: 219, lr: 2.52e-03, grad_scale: 32.0 2023-10-11 22:35:31,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=15.0 2023-10-11 22:35:32,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=854168.0, ans=0.125 2023-10-11 22:35:42,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=22.5 2023-10-11 22:36:10,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=854308.0, ans=0.125 2023-10-11 22:36:11,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=854308.0, ans=0.0 2023-10-11 22:36:13,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=854308.0, ans=0.0 2023-10-11 22:36:19,723 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.657e+02 1.796e+02 1.994e+02 2.634e+02, threshold=3.592e+02, percent-clipped=0.0 2023-10-11 22:36:22,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=854354.6666666666, ans=0.125 2023-10-11 22:36:25,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=854401.3333333334, ans=0.125 2023-10-11 22:36:35,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=854401.3333333334, ans=0.2 2023-10-11 22:36:41,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=854448.0, ans=0.125 2023-10-11 22:36:44,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=854448.0, ans=0.0 2023-10-11 22:36:49,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=854494.6666666666, ans=0.125 2023-10-11 22:36:50,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=854494.6666666666, ans=0.0 2023-10-11 22:37:16,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=854588.0, ans=0.1 2023-10-11 22:37:19,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=854588.0, ans=0.125 2023-10-11 22:37:20,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=854588.0, ans=0.09899494936611666 2023-10-11 22:37:31,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=854634.6666666666, ans=0.125 2023-10-11 22:37:32,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-10-11 22:37:35,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=854681.3333333334, ans=0.0 2023-10-11 22:37:40,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=854681.3333333334, ans=0.2 2023-10-11 22:37:57,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-10-11 22:37:59,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=854774.6666666666, ans=0.125 2023-10-11 22:38:11,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.747e+02 1.902e+02 2.265e+02 3.147e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-11 22:38:30,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=854914.6666666666, ans=0.0 2023-10-11 22:38:36,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=854914.6666666666, ans=0.125 2023-10-11 22:38:38,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=854914.6666666666, ans=0.125 2023-10-11 22:38:41,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=854961.3333333334, ans=0.125 2023-10-11 22:38:51,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=854961.3333333334, ans=0.2 2023-10-11 22:39:06,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=855054.6666666666, ans=0.0 2023-10-11 22:39:10,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=855054.6666666666, ans=0.2 2023-10-11 22:39:10,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=855054.6666666666, ans=0.125 2023-10-11 22:39:18,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=855101.3333333334, ans=0.125 2023-10-11 22:39:31,822 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.91 vs. limit=15.0 2023-10-11 22:39:39,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=855194.6666666666, ans=0.1 2023-10-11 22:39:52,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=855241.3333333334, ans=0.125 2023-10-11 22:40:04,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.318e+02 1.689e+02 1.843e+02 2.037e+02 2.671e+02, threshold=3.686e+02, percent-clipped=0.0 2023-10-11 22:40:11,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=855334.6666666666, ans=0.015 2023-10-11 22:40:33,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=855428.0, ans=0.2 2023-10-11 22:40:36,612 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.01 vs. limit=15.0 2023-10-11 22:40:40,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=855428.0, ans=0.0 2023-10-11 22:40:52,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=855474.6666666666, ans=0.0 2023-10-11 22:40:53,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.00 vs. limit=15.0 2023-10-11 22:40:58,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=855521.3333333334, ans=0.125 2023-10-11 22:41:01,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=855521.3333333334, ans=0.0 2023-10-11 22:41:14,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=855568.0, ans=0.1 2023-10-11 22:41:26,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=855614.6666666666, ans=0.125 2023-10-11 22:41:33,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=855661.3333333334, ans=0.125 2023-10-11 22:41:55,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.697e+02 1.864e+02 2.077e+02 2.805e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-11 22:42:02,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=855801.3333333334, ans=0.2 2023-10-11 22:42:42,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=855941.3333333334, ans=0.02 2023-10-11 22:42:54,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=855988.0, ans=0.125 2023-10-11 22:43:17,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.57 vs. limit=15.0 2023-10-11 22:43:28,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=856128.0, ans=0.0 2023-10-11 22:43:50,715 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.719e+02 1.856e+02 2.017e+02 2.807e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-11 22:43:52,114 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.45 vs. limit=12.0 2023-10-11 22:44:02,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=856268.0, ans=0.0 2023-10-11 22:44:03,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=856268.0, ans=0.1 2023-10-11 22:44:19,852 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-11 22:44:21,458 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:44:28,841 INFO [train.py:1031] (0/4) Epoch 14, batch 6000, loss[loss=0.1968, simple_loss=0.2885, pruned_loss=0.05256, over 16924.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2874, pruned_loss=0.05442, over 31202399.86 frames. ], batch size: 165, lr: 2.52e-03, grad_scale: 32.0 2023-10-11 22:44:29,492 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-10-11 22:44:33,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=856408.0, ans=0.0 2023-10-11 22:44:35,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=856408.0, ans=0.125 2023-10-11 22:44:39,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-10-11 22:45:12,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=856594.6666666666, ans=0.0 2023-10-11 22:45:23,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-10-11 22:45:38,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=856688.0, ans=0.0 2023-10-11 22:45:41,183 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.770e+02 1.974e+02 2.183e+02 2.812e+02, threshold=3.948e+02, percent-clipped=0.0 2023-10-11 22:45:43,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=856688.0, ans=0.2 2023-10-11 22:46:00,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=856781.3333333334, ans=0.1 2023-10-11 22:46:00,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=856781.3333333334, ans=0.0 2023-10-11 22:46:29,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=856874.6666666666, ans=0.125 2023-10-11 22:46:48,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=856968.0, ans=0.125 2023-10-11 22:46:53,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=856968.0, ans=0.1 2023-10-11 22:46:54,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=856968.0, ans=0.0 2023-10-11 22:46:55,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=856968.0, ans=0.0 2023-10-11 22:46:56,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=856968.0, ans=0.125 2023-10-11 22:46:58,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=857014.6666666666, ans=0.125 2023-10-11 22:47:01,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=857014.6666666666, ans=0.1 2023-10-11 22:47:12,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=857061.3333333334, ans=10.0 2023-10-11 22:47:12,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=857061.3333333334, ans=0.0 2023-10-11 22:47:15,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=857061.3333333334, ans=0.125 2023-10-11 22:47:24,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=857108.0, ans=0.2 2023-10-11 22:47:35,894 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.775e+02 1.930e+02 2.246e+02 3.746e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-11 22:47:55,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=857248.0, ans=0.2 2023-10-11 22:48:02,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=857248.0, ans=0.125 2023-10-11 22:48:16,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=857341.3333333334, ans=0.125 2023-10-11 22:48:19,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=857341.3333333334, ans=0.0 2023-10-11 22:48:22,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=857341.3333333334, ans=0.125 2023-10-11 22:48:43,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=857434.6666666666, ans=0.0 2023-10-11 22:49:08,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=857528.0, ans=0.0 2023-10-11 22:49:09,208 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.60 vs. limit=22.5 2023-10-11 22:49:25,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.674e+02 1.965e+02 2.226e+02 3.549e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-11 22:49:31,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=857668.0, ans=0.125 2023-10-11 22:49:36,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-10-11 22:50:22,780 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-10-11 22:50:24,369 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:50:29,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=857901.3333333334, ans=0.125 2023-10-11 22:50:38,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.33 vs. limit=15.0 2023-10-11 22:50:49,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=857948.0, ans=0.035 2023-10-11 22:50:53,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=857948.0, ans=0.125 2023-10-11 22:50:59,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=22.5 2023-10-11 22:51:01,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.04 vs. limit=15.0 2023-10-11 22:51:06,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=858041.3333333334, ans=0.0 2023-10-11 22:51:25,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.687e+02 1.887e+02 2.192e+02 4.312e+02, threshold=3.773e+02, percent-clipped=1.0 2023-10-11 22:51:26,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=858088.0, ans=0.125 2023-10-11 22:51:49,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=858181.3333333334, ans=0.0 2023-10-11 22:51:54,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=858181.3333333334, ans=0.0 2023-10-11 22:52:51,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=858414.6666666666, ans=0.07 2023-10-11 22:53:19,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=858554.6666666666, ans=0.09899494936611666 2023-10-11 22:53:19,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=858554.6666666666, ans=0.0 2023-10-11 22:53:21,866 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.654e+02 1.823e+02 2.129e+02 3.051e+02, threshold=3.646e+02, percent-clipped=0.0 2023-10-11 22:53:45,331 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-184000.pt 2023-10-11 22:53:51,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=858648.0, ans=0.125 2023-10-11 22:54:04,778 INFO [train.py:1031] (0/4) Epoch 14, batch 6500, loss[loss=0.237, simple_loss=0.3189, pruned_loss=0.07751, over 16634.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.288, pruned_loss=0.05457, over 31566173.26 frames. ], batch size: 219, lr: 2.51e-03, grad_scale: 32.0 2023-10-11 22:54:07,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=858741.3333333334, ans=0.04949747468305833 2023-10-11 22:54:12,199 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-10-11 22:54:26,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.64 vs. limit=15.0 2023-10-11 22:54:29,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=858788.0, ans=0.125 2023-10-11 22:54:30,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=858834.6666666666, ans=0.0 2023-10-11 22:54:30,909 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=15.0 2023-10-11 22:54:59,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=858928.0, ans=0.0 2023-10-11 22:55:27,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.773e+02 1.948e+02 2.113e+02 3.177e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-11 22:55:31,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=859021.3333333334, ans=15.0 2023-10-11 22:55:34,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=859068.0, ans=0.05 2023-10-11 22:56:01,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=859161.3333333334, ans=0.0 2023-10-11 22:56:01,873 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-10-11 22:56:02,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=859161.3333333334, ans=0.125 2023-10-11 22:56:08,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=859208.0, ans=0.0 2023-10-11 22:56:15,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=859208.0, ans=0.05 2023-10-11 22:56:50,198 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=12.0 2023-10-11 22:57:02,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=859441.3333333334, ans=0.2 2023-10-11 22:57:14,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.748e+02 1.938e+02 2.197e+02 3.048e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-11 22:57:28,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=859534.6666666666, ans=0.125 2023-10-11 22:57:50,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=859674.6666666666, ans=0.125 2023-10-11 22:58:08,604 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 22:58:18,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=859768.0, ans=0.125 2023-10-11 22:58:33,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=859814.6666666666, ans=0.1 2023-10-11 22:59:06,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.739e+02 1.893e+02 2.190e+02 2.880e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-11 22:59:09,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=859954.6666666666, ans=0.0 2023-10-11 22:59:39,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=860094.6666666666, ans=0.1 2023-10-11 22:59:54,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-10-11 22:59:55,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=860141.3333333334, ans=0.0 2023-10-11 23:00:00,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=860141.3333333334, ans=0.1 2023-10-11 23:00:10,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=860188.0, ans=0.0 2023-10-11 23:00:33,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=860281.3333333334, ans=0.125 2023-10-11 23:00:47,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=860328.0, ans=0.05 2023-10-11 23:00:47,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=860328.0, ans=0.0 2023-10-11 23:00:47,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=860328.0, ans=0.125 2023-10-11 23:00:50,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=860328.0, ans=0.125 2023-10-11 23:00:51,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=860328.0, ans=0.125 2023-10-11 23:01:10,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=860374.6666666666, ans=0.0 2023-10-11 23:01:14,214 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:01:16,212 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2023-10-11 23:01:18,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.318e+02 1.650e+02 1.822e+02 2.121e+02 2.809e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-11 23:01:27,968 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:01:36,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=860514.6666666666, ans=22.5 2023-10-11 23:01:57,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=860561.3333333334, ans=0.0 2023-10-11 23:02:09,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=860608.0, ans=0.0 2023-10-11 23:02:10,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=860654.6666666666, ans=0.0 2023-10-11 23:02:21,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=860701.3333333334, ans=0.125 2023-10-11 23:02:35,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=860748.0, ans=0.125 2023-10-11 23:02:51,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=860794.6666666666, ans=0.0 2023-10-11 23:02:58,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=860841.3333333334, ans=0.125 2023-10-11 23:03:01,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=860841.3333333334, ans=0.0 2023-10-11 23:03:12,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.732e+02 1.858e+02 2.109e+02 3.102e+02, threshold=3.716e+02, percent-clipped=0.0 2023-10-11 23:03:23,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.52 vs. limit=22.5 2023-10-11 23:03:32,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.94 vs. limit=22.5 2023-10-11 23:03:42,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=861028.0, ans=0.1 2023-10-11 23:03:48,009 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.59 vs. limit=15.0 2023-10-11 23:03:48,411 INFO [train.py:1031] (0/4) Epoch 14, batch 7000, loss[loss=0.2136, simple_loss=0.2713, pruned_loss=0.07797, over 12372.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2884, pruned_loss=0.05426, over 31870865.42 frames. ], batch size: 440, lr: 2.51e-03, grad_scale: 16.0 2023-10-11 23:04:10,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=861121.3333333334, ans=0.2 2023-10-11 23:04:18,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=861168.0, ans=0.04949747468305833 2023-10-11 23:04:21,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=861168.0, ans=0.2 2023-10-11 23:04:43,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=861261.3333333334, ans=0.1 2023-10-11 23:04:44,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=861261.3333333334, ans=0.0 2023-10-11 23:04:57,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=861354.6666666666, ans=0.0 2023-10-11 23:04:59,207 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.73 vs. limit=15.0 2023-10-11 23:05:03,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.24 vs. limit=15.0 2023-10-11 23:05:05,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.762e+02 1.897e+02 2.141e+02 3.248e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-11 23:05:43,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=861541.3333333334, ans=0.0 2023-10-11 23:05:45,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=861541.3333333334, ans=0.0 2023-10-11 23:06:10,924 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-10-11 23:06:22,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.47 vs. limit=6.0 2023-10-11 23:06:38,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=861728.0, ans=0.125 2023-10-11 23:06:39,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=861774.6666666666, ans=0.125 2023-10-11 23:06:40,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=861774.6666666666, ans=0.125 2023-10-11 23:06:55,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=861821.3333333334, ans=0.95 2023-10-11 23:06:58,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.724e+02 1.849e+02 2.098e+02 3.047e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-11 23:07:04,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=861868.0, ans=0.0 2023-10-11 23:07:06,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.91 vs. limit=15.0 2023-10-11 23:07:17,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=861914.6666666666, ans=0.0 2023-10-11 23:07:50,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=862008.0, ans=0.5 2023-10-11 23:08:15,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=862101.3333333334, ans=0.125 2023-10-11 23:08:53,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=862241.3333333334, ans=0.2 2023-10-11 23:08:58,287 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.99 vs. limit=15.0 2023-10-11 23:09:00,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=862288.0, ans=0.125 2023-10-11 23:09:02,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=862288.0, ans=0.1 2023-10-11 23:09:06,542 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.684e+02 1.859e+02 2.216e+02 3.139e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-11 23:09:29,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=862381.3333333334, ans=0.2 2023-10-11 23:09:30,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=862381.3333333334, ans=0.125 2023-10-11 23:09:37,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=862428.0, ans=0.1 2023-10-11 23:09:37,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=862428.0, ans=0.125 2023-10-11 23:09:41,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=862428.0, ans=0.04949747468305833 2023-10-11 23:09:42,049 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.61 vs. limit=15.0 2023-10-11 23:09:50,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=862474.6666666666, ans=0.0 2023-10-11 23:10:00,020 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.42 vs. limit=10.0 2023-10-11 23:10:02,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=862521.3333333334, ans=0.025 2023-10-11 23:10:34,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=862661.3333333334, ans=0.125 2023-10-11 23:10:39,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=862661.3333333334, ans=0.125 2023-10-11 23:10:53,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=862708.0, ans=0.125 2023-10-11 23:10:53,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=862708.0, ans=0.2 2023-10-11 23:11:06,194 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.737e+02 1.914e+02 2.173e+02 3.091e+02, threshold=3.829e+02, percent-clipped=0.0 2023-10-11 23:11:17,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=862801.3333333334, ans=0.025 2023-10-11 23:11:23,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=862848.0, ans=0.0 2023-10-11 23:11:34,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=862894.6666666666, ans=0.0 2023-10-11 23:11:55,959 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:12:05,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=863034.6666666666, ans=0.0 2023-10-11 23:12:09,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=863034.6666666666, ans=0.0 2023-10-11 23:12:13,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=863034.6666666666, ans=0.125 2023-10-11 23:12:25,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.12 vs. limit=22.5 2023-10-11 23:12:37,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.89 vs. limit=22.5 2023-10-11 23:12:42,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=863174.6666666666, ans=0.125 2023-10-11 23:12:42,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=863174.6666666666, ans=0.0 2023-10-11 23:12:50,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.56 vs. limit=15.0 2023-10-11 23:12:53,317 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.835e+02 2.087e+02 2.441e+02 3.560e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-11 23:12:56,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=863268.0, ans=0.1 2023-10-11 23:12:57,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=863268.0, ans=0.1 2023-10-11 23:13:02,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=863268.0, ans=0.0 2023-10-11 23:13:04,957 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.90 vs. limit=15.0 2023-10-11 23:13:05,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=863314.6666666666, ans=0.125 2023-10-11 23:13:07,566 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:13:14,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=863314.6666666666, ans=0.125 2023-10-11 23:13:23,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=863361.3333333334, ans=0.125 2023-10-11 23:13:30,031 INFO [train.py:1031] (0/4) Epoch 14, batch 7500, loss[loss=0.1997, simple_loss=0.2885, pruned_loss=0.0554, over 16692.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.288, pruned_loss=0.05422, over 32069778.91 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 32.0 2023-10-11 23:13:38,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=863408.0, ans=0.1 2023-10-11 23:13:54,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=863501.3333333334, ans=0.0 2023-10-11 23:13:56,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=863501.3333333334, ans=0.1 2023-10-11 23:14:03,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=863548.0, ans=0.025 2023-10-11 23:14:17,876 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.96 vs. limit=6.0 2023-10-11 23:14:26,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=863641.3333333334, ans=0.1 2023-10-11 23:14:39,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=863688.0, ans=0.125 2023-10-11 23:14:46,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.716e+02 1.898e+02 2.028e+02 2.979e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-11 23:14:57,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=863734.6666666666, ans=0.125 2023-10-11 23:15:00,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=863734.6666666666, ans=0.1 2023-10-11 23:15:00,367 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.16 vs. limit=22.5 2023-10-11 23:15:14,781 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-10-11 23:15:27,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=863874.6666666666, ans=0.2 2023-10-11 23:15:30,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=863874.6666666666, ans=0.05 2023-10-11 23:15:30,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=863874.6666666666, ans=0.125 2023-10-11 23:15:31,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=863874.6666666666, ans=0.0 2023-10-11 23:15:35,004 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-10-11 23:15:51,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=863968.0, ans=0.0 2023-10-11 23:16:15,263 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:16:31,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=864108.0, ans=0.125 2023-10-11 23:16:51,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.672e+02 1.872e+02 2.094e+02 2.810e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-11 23:16:52,823 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-10-11 23:16:54,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=864201.3333333334, ans=0.125 2023-10-11 23:17:06,014 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-10-11 23:17:14,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=15.0 2023-10-11 23:17:30,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=864341.3333333334, ans=0.2 2023-10-11 23:17:33,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=864341.3333333334, ans=0.125 2023-10-11 23:17:35,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=864341.3333333334, ans=0.125 2023-10-11 23:17:50,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=864434.6666666666, ans=0.1 2023-10-11 23:17:59,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=864434.6666666666, ans=0.015 2023-10-11 23:18:05,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=864481.3333333334, ans=0.125 2023-10-11 23:18:14,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=864528.0, ans=0.125 2023-10-11 23:18:19,362 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.51 vs. limit=12.0 2023-10-11 23:18:21,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=864528.0, ans=0.0 2023-10-11 23:18:26,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=864574.6666666666, ans=0.1 2023-10-11 23:18:28,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=864574.6666666666, ans=0.125 2023-10-11 23:18:33,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.39 vs. limit=15.0 2023-10-11 23:18:35,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=864621.3333333334, ans=0.125 2023-10-11 23:18:42,972 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.681e+02 1.808e+02 1.986e+02 2.443e+02, threshold=3.616e+02, percent-clipped=0.0 2023-10-11 23:18:44,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=864621.3333333334, ans=0.0 2023-10-11 23:19:12,519 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.32 vs. limit=12.0 2023-10-11 23:19:26,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=864808.0, ans=0.0 2023-10-11 23:19:29,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=864854.6666666666, ans=0.07 2023-10-11 23:19:40,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=864854.6666666666, ans=0.0 2023-10-11 23:20:31,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=865088.0, ans=0.125 2023-10-11 23:20:34,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=865088.0, ans=0.04949747468305833 2023-10-11 23:20:37,197 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.702e+02 1.899e+02 2.123e+02 2.680e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-11 23:20:53,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=865181.3333333334, ans=0.2 2023-10-11 23:20:53,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-11 23:20:55,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=865181.3333333334, ans=0.2 2023-10-11 23:21:09,870 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.91 vs. limit=15.0 2023-10-11 23:21:14,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.31 vs. limit=15.0 2023-10-11 23:21:15,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=865274.6666666666, ans=0.0 2023-10-11 23:21:17,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=865274.6666666666, ans=0.0 2023-10-11 23:21:37,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=865368.0, ans=0.1 2023-10-11 23:22:03,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=865461.3333333334, ans=0.1 2023-10-11 23:22:04,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.96 vs. limit=15.0 2023-10-11 23:22:24,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=865554.6666666666, ans=0.125 2023-10-11 23:22:32,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.647e+02 1.783e+02 2.028e+02 3.925e+02, threshold=3.565e+02, percent-clipped=1.0 2023-10-11 23:22:37,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-10-11 23:22:51,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=865648.0, ans=0.125 2023-10-11 23:22:54,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-10-11 23:23:10,376 INFO [train.py:1031] (0/4) Epoch 14, batch 8000, loss[loss=0.1764, simple_loss=0.2744, pruned_loss=0.03918, over 16800.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2874, pruned_loss=0.05374, over 32207176.67 frames. ], batch size: 175, lr: 2.50e-03, grad_scale: 32.0 2023-10-11 23:24:02,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.32 vs. limit=15.0 2023-10-11 23:24:05,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=865974.6666666666, ans=0.125 2023-10-11 23:24:21,499 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.599e+02 1.762e+02 1.959e+02 2.488e+02, threshold=3.525e+02, percent-clipped=0.0 2023-10-11 23:24:51,276 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2023-10-11 23:25:01,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=866208.0, ans=0.1 2023-10-11 23:25:03,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=866208.0, ans=0.125 2023-10-11 23:25:19,753 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.54 vs. limit=15.0 2023-10-11 23:25:33,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=866348.0, ans=15.0 2023-10-11 23:26:18,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=866488.0, ans=0.0 2023-10-11 23:26:23,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.684e+02 1.869e+02 2.126e+02 2.927e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-11 23:27:09,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=866674.6666666666, ans=0.0 2023-10-11 23:27:30,155 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.51 vs. limit=22.5 2023-10-11 23:27:36,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=866768.0, ans=0.1 2023-10-11 23:27:57,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=866861.3333333334, ans=0.125 2023-10-11 23:27:58,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-10-11 23:28:20,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=866954.6666666666, ans=0.0 2023-10-11 23:28:24,114 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.711e+02 1.920e+02 2.196e+02 3.511e+02, threshold=3.839e+02, percent-clipped=0.0 2023-10-11 23:28:33,463 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.46 vs. limit=15.0 2023-10-11 23:28:58,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=867094.6666666666, ans=0.125 2023-10-11 23:29:04,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=867141.3333333334, ans=0.125 2023-10-11 23:29:20,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=867188.0, ans=0.125 2023-10-11 23:29:23,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=867188.0, ans=0.0 2023-10-11 23:29:39,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=867281.3333333334, ans=0.0 2023-10-11 23:29:41,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=867281.3333333334, ans=0.0 2023-10-11 23:29:44,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=867281.3333333334, ans=0.0 2023-10-11 23:30:00,313 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.45 vs. limit=22.5 2023-10-11 23:30:06,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=867374.6666666666, ans=0.0 2023-10-11 23:30:21,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.770e+02 1.865e+02 2.101e+02 2.625e+02, threshold=3.730e+02, percent-clipped=0.0 2023-10-11 23:30:23,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=867468.0, ans=0.125 2023-10-11 23:30:28,446 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-11 23:30:36,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=867514.6666666666, ans=0.0 2023-10-11 23:31:08,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=867608.0, ans=0.04949747468305833 2023-10-11 23:32:13,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=867888.0, ans=0.125 2023-10-11 23:32:13,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=867888.0, ans=0.125 2023-10-11 23:32:17,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.295e+02 1.731e+02 1.895e+02 2.142e+02 2.955e+02, threshold=3.791e+02, percent-clipped=0.0 2023-10-11 23:32:52,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=868028.0, ans=0.04949747468305833 2023-10-11 23:33:00,230 INFO [train.py:1031] (0/4) Epoch 14, batch 8500, loss[loss=0.2237, simple_loss=0.3111, pruned_loss=0.06818, over 16704.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2878, pruned_loss=0.05372, over 32343091.99 frames. ], batch size: 202, lr: 2.50e-03, grad_scale: 32.0 2023-10-11 23:33:33,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=868214.6666666666, ans=0.125 2023-10-11 23:33:39,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=868214.6666666666, ans=0.125 2023-10-11 23:33:41,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=868214.6666666666, ans=0.125 2023-10-11 23:34:18,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=868354.6666666666, ans=15.0 2023-10-11 23:34:18,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.875e+02 2.114e+02 2.375e+02 3.607e+02, threshold=4.227e+02, percent-clipped=0.0 2023-10-11 23:34:34,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=868448.0, ans=0.125 2023-10-11 23:35:05,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=868541.3333333334, ans=0.0 2023-10-11 23:35:30,090 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-10-11 23:35:48,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=868728.0, ans=0.125 2023-10-11 23:36:13,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=868821.3333333334, ans=0.0 2023-10-11 23:36:21,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.269e+02 1.688e+02 1.873e+02 2.106e+02 2.921e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-11 23:36:45,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=868914.6666666666, ans=0.04949747468305833 2023-10-11 23:36:57,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=868961.3333333334, ans=0.125 2023-10-11 23:36:58,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=868961.3333333334, ans=0.2 2023-10-11 23:37:00,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=868961.3333333334, ans=0.125 2023-10-11 23:37:01,901 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:37:20,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=869054.6666666666, ans=0.04949747468305833 2023-10-11 23:37:46,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=869148.0, ans=0.125 2023-10-11 23:38:06,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=869241.3333333334, ans=0.125 2023-10-11 23:38:08,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=869241.3333333334, ans=0.0 2023-10-11 23:38:10,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=869241.3333333334, ans=0.0 2023-10-11 23:38:18,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=869288.0, ans=0.125 2023-10-11 23:38:18,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=869288.0, ans=0.125 2023-10-11 23:38:22,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=869288.0, ans=0.125 2023-10-11 23:38:30,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.652e+02 1.746e+02 1.972e+02 3.700e+02, threshold=3.493e+02, percent-clipped=0.0 2023-10-11 23:38:39,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=869334.6666666666, ans=0.0 2023-10-11 23:38:39,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=869334.6666666666, ans=0.07 2023-10-11 23:38:42,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=869334.6666666666, ans=0.0 2023-10-11 23:38:58,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=869428.0, ans=0.125 2023-10-11 23:39:01,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=869428.0, ans=0.2 2023-10-11 23:39:03,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=869428.0, ans=0.125 2023-10-11 23:39:06,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=869474.6666666666, ans=0.125 2023-10-11 23:39:07,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=869474.6666666666, ans=0.125 2023-10-11 23:39:11,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=869474.6666666666, ans=0.2 2023-10-11 23:39:33,930 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.97 vs. limit=15.0 2023-10-11 23:39:37,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=869614.6666666666, ans=0.0 2023-10-11 23:39:52,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=869661.3333333334, ans=0.125 2023-10-11 23:39:58,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=869661.3333333334, ans=0.0 2023-10-11 23:40:00,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=869708.0, ans=0.125 2023-10-11 23:40:20,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.683e+02 1.845e+02 2.150e+02 3.471e+02, threshold=3.690e+02, percent-clipped=0.0 2023-10-11 23:40:42,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=869848.0, ans=0.1 2023-10-11 23:41:23,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=870034.6666666666, ans=0.125 2023-10-11 23:41:27,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=870034.6666666666, ans=0.0 2023-10-11 23:41:40,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=870128.0, ans=0.0 2023-10-11 23:41:42,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=870128.0, ans=0.125 2023-10-11 23:41:43,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=870128.0, ans=0.125 2023-10-11 23:42:03,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=870221.3333333334, ans=0.0 2023-10-11 23:42:12,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.734e+02 1.885e+02 2.189e+02 3.141e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-11 23:42:33,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=870314.6666666666, ans=0.125 2023-10-11 23:42:34,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=870314.6666666666, ans=0.125 2023-10-11 23:42:39,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=870361.3333333334, ans=0.2 2023-10-11 23:42:40,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=870361.3333333334, ans=0.035 2023-10-11 23:42:48,022 INFO [train.py:1031] (0/4) Epoch 14, batch 9000, loss[loss=0.201, simple_loss=0.3046, pruned_loss=0.0487, over 16899.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2872, pruned_loss=0.05347, over 32454321.50 frames. ], batch size: 130, lr: 2.50e-03, grad_scale: 32.0 2023-10-11 23:42:54,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=870408.0, ans=0.125 2023-10-11 23:43:07,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=870454.6666666666, ans=0.1 2023-10-11 23:43:22,319 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.96 vs. limit=15.0 2023-10-11 23:43:28,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=870548.0, ans=0.125 2023-10-11 23:43:54,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=870688.0, ans=0.035 2023-10-11 23:43:59,809 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.687e+02 1.863e+02 2.077e+02 3.074e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-11 23:44:06,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=870734.6666666666, ans=0.05 2023-10-11 23:44:10,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.39 vs. limit=22.5 2023-10-11 23:44:16,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=870781.3333333334, ans=0.1 2023-10-11 23:44:34,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=870874.6666666666, ans=0.0 2023-10-11 23:44:35,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=870874.6666666666, ans=0.1 2023-10-11 23:44:50,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.35 vs. limit=22.5 2023-10-11 23:44:56,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=870968.0, ans=0.0 2023-10-11 23:45:04,116 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.96 vs. limit=15.0 2023-10-11 23:45:12,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=871014.6666666666, ans=0.0 2023-10-11 23:45:14,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=871061.3333333334, ans=0.0 2023-10-11 23:45:15,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=871061.3333333334, ans=0.0 2023-10-11 23:45:16,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=12.0 2023-10-11 23:45:20,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=871061.3333333334, ans=0.125 2023-10-11 23:45:38,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=871154.6666666666, ans=0.125 2023-10-11 23:45:42,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=871154.6666666666, ans=0.025 2023-10-11 23:45:44,977 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.756e+02 2.016e+02 2.238e+02 3.481e+02, threshold=4.032e+02, percent-clipped=0.0 2023-10-11 23:46:17,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=871294.6666666666, ans=0.125 2023-10-11 23:46:25,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=871341.3333333334, ans=0.5 2023-10-11 23:46:47,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=871434.6666666666, ans=0.125 2023-10-11 23:47:00,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.32 vs. limit=22.5 2023-10-11 23:47:18,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=871574.6666666666, ans=0.2 2023-10-11 23:47:29,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=871621.3333333334, ans=0.0 2023-10-11 23:47:31,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.722e+02 1.893e+02 2.157e+02 3.696e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-11 23:47:42,140 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-11 23:47:58,366 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-11 23:48:07,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.70 vs. limit=22.5 2023-10-11 23:48:09,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=15.0 2023-10-11 23:48:10,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=871808.0, ans=0.0 2023-10-11 23:48:13,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=871808.0, ans=0.125 2023-10-11 23:48:15,485 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.86 vs. limit=15.0 2023-10-11 23:48:19,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=871854.6666666666, ans=0.125 2023-10-11 23:49:30,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=872088.0, ans=0.125 2023-10-11 23:49:32,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.768e+02 1.901e+02 2.294e+02 3.356e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-11 23:49:38,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=872134.6666666666, ans=0.1 2023-10-11 23:50:13,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=872274.6666666666, ans=0.125 2023-10-11 23:50:17,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=872274.6666666666, ans=0.125 2023-10-11 23:50:24,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=22.5 2023-10-11 23:50:28,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=872321.3333333334, ans=0.95 2023-10-11 23:50:36,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=872368.0, ans=0.125 2023-10-11 23:50:43,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=872368.0, ans=0.2 2023-10-11 23:50:44,279 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.03 vs. limit=22.5 2023-10-11 23:50:48,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=872414.6666666666, ans=0.125 2023-10-11 23:50:53,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=872414.6666666666, ans=0.0 2023-10-11 23:51:30,839 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.779e+02 1.994e+02 2.183e+02 2.813e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-11 23:51:35,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-10-11 23:51:36,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=872601.3333333334, ans=0.0 2023-10-11 23:51:37,755 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.44 vs. limit=10.0 2023-10-11 23:51:39,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=872601.3333333334, ans=0.025 2023-10-11 23:52:06,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=872694.6666666666, ans=0.125 2023-10-11 23:52:10,329 INFO [train.py:1031] (0/4) Epoch 14, batch 9500, loss[loss=0.2056, simple_loss=0.2713, pruned_loss=0.06993, over 12582.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2875, pruned_loss=0.05356, over 32513378.52 frames. ], batch size: 440, lr: 2.49e-03, grad_scale: 32.0 2023-10-11 23:52:15,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=872741.3333333334, ans=0.5 2023-10-11 23:52:42,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=872834.6666666666, ans=10.0 2023-10-11 23:52:46,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=872881.3333333334, ans=0.0 2023-10-11 23:52:54,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=872928.0, ans=0.1 2023-10-11 23:53:23,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=873021.3333333334, ans=0.0 2023-10-11 23:53:27,683 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.737e+02 1.883e+02 2.252e+02 3.994e+02, threshold=3.766e+02, percent-clipped=1.0 2023-10-11 23:53:30,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=873068.0, ans=0.0 2023-10-11 23:53:35,464 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.75 vs. limit=15.0 2023-10-11 23:53:39,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.72 vs. limit=12.0 2023-10-11 23:53:52,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=873161.3333333334, ans=0.1 2023-10-11 23:53:53,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=873161.3333333334, ans=0.0 2023-10-11 23:54:16,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=873254.6666666666, ans=0.125 2023-10-11 23:54:18,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=873254.6666666666, ans=0.2 2023-10-11 23:55:03,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=873441.3333333334, ans=10.0 2023-10-11 23:55:05,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=873441.3333333334, ans=0.125 2023-10-11 23:55:10,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=873488.0, ans=0.125 2023-10-11 23:55:20,108 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.754e+02 1.923e+02 2.256e+02 2.725e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-11 23:56:03,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=873674.6666666666, ans=0.0 2023-10-11 23:56:03,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.55 vs. limit=15.0 2023-10-11 23:56:05,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=873674.6666666666, ans=0.125 2023-10-11 23:56:11,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=873721.3333333334, ans=0.0 2023-10-11 23:56:11,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=873721.3333333334, ans=0.0 2023-10-11 23:56:36,891 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=22.5 2023-10-11 23:56:44,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=873861.3333333334, ans=0.1 2023-10-11 23:56:46,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=873861.3333333334, ans=0.125 2023-10-11 23:56:58,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=873908.0, ans=0.0 2023-10-11 23:57:02,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=873908.0, ans=0.0 2023-10-11 23:57:13,266 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.642e+02 1.759e+02 1.905e+02 2.849e+02, threshold=3.517e+02, percent-clipped=0.0 2023-10-11 23:57:23,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=874001.3333333334, ans=0.05 2023-10-11 23:57:27,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=874048.0, ans=0.125 2023-10-11 23:57:56,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=874141.3333333334, ans=0.5 2023-10-11 23:57:56,483 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-10-11 23:58:01,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=874141.3333333334, ans=0.125 2023-10-11 23:58:03,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=874141.3333333334, ans=0.1 2023-10-11 23:58:13,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=874188.0, ans=10.0 2023-10-11 23:58:41,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=874328.0, ans=0.125 2023-10-11 23:59:07,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=874421.3333333334, ans=0.0 2023-10-11 23:59:09,440 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.630e+02 1.756e+02 1.959e+02 2.658e+02, threshold=3.511e+02, percent-clipped=0.0 2023-10-11 23:59:18,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=874468.0, ans=0.1 2023-10-11 23:59:26,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=874514.6666666666, ans=0.125 2023-10-11 23:59:26,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-10-11 23:59:29,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=874514.6666666666, ans=0.1 2023-10-11 23:59:48,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=874608.0, ans=0.125 2023-10-11 23:59:55,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=874654.6666666666, ans=0.025 2023-10-11 23:59:58,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=874654.6666666666, ans=0.125 2023-10-11 23:59:59,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=874654.6666666666, ans=0.1 2023-10-12 00:00:05,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=874701.3333333334, ans=0.1 2023-10-12 00:00:12,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.96 vs. limit=15.0 2023-10-12 00:00:20,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=874748.0, ans=0.0 2023-10-12 00:00:22,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=874748.0, ans=0.0 2023-10-12 00:00:42,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=874841.3333333334, ans=0.125 2023-10-12 00:00:51,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=874888.0, ans=0.1 2023-10-12 00:01:00,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.658e+02 1.799e+02 1.944e+02 3.303e+02, threshold=3.597e+02, percent-clipped=0.0 2023-10-12 00:01:04,881 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.38 vs. limit=15.0 2023-10-12 00:01:06,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=874934.6666666666, ans=0.125 2023-10-12 00:01:19,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=874981.3333333334, ans=0.0 2023-10-12 00:01:30,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.34 vs. limit=22.5 2023-10-12 00:01:30,626 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:01:32,352 INFO [train.py:1031] (0/4) Epoch 14, batch 10000, loss[loss=0.2024, simple_loss=0.2904, pruned_loss=0.05719, over 16585.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2869, pruned_loss=0.05351, over 32563403.54 frames. ], batch size: 61, lr: 2.49e-03, grad_scale: 32.0 2023-10-12 00:01:32,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=875074.6666666666, ans=0.2 2023-10-12 00:01:35,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=875074.6666666666, ans=0.07 2023-10-12 00:01:58,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=875168.0, ans=0.1 2023-10-12 00:02:09,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=875214.6666666666, ans=0.0 2023-10-12 00:02:13,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=875261.3333333334, ans=0.1 2023-10-12 00:02:17,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=875261.3333333334, ans=0.1 2023-10-12 00:02:19,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=875261.3333333334, ans=12.0 2023-10-12 00:02:20,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=875261.3333333334, ans=0.0 2023-10-12 00:02:20,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2023-10-12 00:02:43,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.81 vs. limit=12.0 2023-10-12 00:02:45,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.714e+02 1.868e+02 2.127e+02 2.853e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-12 00:02:46,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=875401.3333333334, ans=0.125 2023-10-12 00:02:49,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=875401.3333333334, ans=0.125 2023-10-12 00:02:54,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=875401.3333333334, ans=0.125 2023-10-12 00:03:41,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=875588.0, ans=0.2 2023-10-12 00:04:07,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=875728.0, ans=0.1 2023-10-12 00:04:08,224 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-10-12 00:04:10,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=875728.0, ans=0.125 2023-10-12 00:04:30,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=875821.3333333334, ans=0.125 2023-10-12 00:04:38,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.739e+02 1.911e+02 2.085e+02 2.980e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-12 00:04:52,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=875914.6666666666, ans=0.0 2023-10-12 00:05:08,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=876008.0, ans=0.125 2023-10-12 00:05:40,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=876101.3333333334, ans=0.035 2023-10-12 00:05:49,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=876148.0, ans=0.0 2023-10-12 00:05:58,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=876194.6666666666, ans=0.125 2023-10-12 00:05:59,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=876194.6666666666, ans=0.125 2023-10-12 00:06:01,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=876194.6666666666, ans=0.125 2023-10-12 00:06:02,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=876194.6666666666, ans=0.125 2023-10-12 00:06:03,188 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.50 vs. limit=15.0 2023-10-12 00:06:33,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.735e+02 1.874e+02 1.979e+02 3.129e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-12 00:06:46,418 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.00 vs. limit=15.0 2023-10-12 00:07:01,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=876428.0, ans=0.0 2023-10-12 00:07:09,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=876474.6666666666, ans=0.2 2023-10-12 00:07:23,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=876521.3333333334, ans=0.0 2023-10-12 00:07:25,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=876521.3333333334, ans=0.0 2023-10-12 00:07:29,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=876521.3333333334, ans=0.125 2023-10-12 00:07:41,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=876568.0, ans=0.0 2023-10-12 00:07:52,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=876614.6666666666, ans=0.025 2023-10-12 00:07:52,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=876614.6666666666, ans=0.125 2023-10-12 00:08:14,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=876708.0, ans=0.125 2023-10-12 00:08:18,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=876754.6666666666, ans=0.05 2023-10-12 00:08:33,678 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.702e+02 1.860e+02 2.093e+02 2.658e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-12 00:08:34,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=876801.3333333334, ans=0.2 2023-10-12 00:09:23,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=876988.0, ans=0.125 2023-10-12 00:09:26,399 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-10-12 00:09:48,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=877081.3333333334, ans=0.1 2023-10-12 00:10:32,610 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-12 00:10:36,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.688e+02 1.868e+02 2.213e+02 3.138e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-12 00:10:41,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=877268.0, ans=0.0 2023-10-12 00:10:53,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=877314.6666666666, ans=0.1 2023-10-12 00:11:06,917 INFO [train.py:1031] (0/4) Epoch 14, batch 10500, loss[loss=0.1893, simple_loss=0.2826, pruned_loss=0.04799, over 16953.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2872, pruned_loss=0.05347, over 32626492.22 frames. ], batch size: 77, lr: 2.49e-03, grad_scale: 16.0 2023-10-12 00:11:09,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=877408.0, ans=0.125 2023-10-12 00:11:22,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=877454.6666666666, ans=0.09899494936611666 2023-10-12 00:11:23,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=877454.6666666666, ans=0.09899494936611666 2023-10-12 00:11:46,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=877548.0, ans=0.125 2023-10-12 00:11:52,524 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:11:55,935 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.25 vs. limit=10.0 2023-10-12 00:11:57,740 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.78 vs. limit=15.0 2023-10-12 00:12:02,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=877641.3333333334, ans=0.0 2023-10-12 00:12:20,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=877688.0, ans=0.09899494936611666 2023-10-12 00:12:29,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.673e+02 1.835e+02 2.012e+02 2.804e+02, threshold=3.671e+02, percent-clipped=0.0 2023-10-12 00:12:31,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=877734.6666666666, ans=0.125 2023-10-12 00:12:39,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=877734.6666666666, ans=0.125 2023-10-12 00:12:44,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=877781.3333333334, ans=0.125 2023-10-12 00:12:47,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.07 vs. limit=15.0 2023-10-12 00:12:57,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=877828.0, ans=0.125 2023-10-12 00:13:00,477 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-10-12 00:13:08,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=877828.0, ans=0.1 2023-10-12 00:13:16,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=877874.6666666666, ans=0.05 2023-10-12 00:13:39,978 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.97 vs. limit=22.5 2023-10-12 00:14:10,172 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.05 vs. limit=15.0 2023-10-12 00:14:13,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=878108.0, ans=0.05 2023-10-12 00:14:20,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=878154.6666666666, ans=0.1 2023-10-12 00:14:24,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=878154.6666666666, ans=0.125 2023-10-12 00:14:35,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.727e+02 1.906e+02 2.221e+02 3.186e+02, threshold=3.813e+02, percent-clipped=0.0 2023-10-12 00:15:16,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=878341.3333333334, ans=0.125 2023-10-12 00:16:03,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=878528.0, ans=0.2 2023-10-12 00:16:04,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=878528.0, ans=0.025 2023-10-12 00:16:20,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=878574.6666666666, ans=0.125 2023-10-12 00:16:29,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=878621.3333333334, ans=0.125 2023-10-12 00:16:32,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=878621.3333333334, ans=0.2 2023-10-12 00:16:36,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2023-10-12 00:16:38,075 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.709e+02 1.892e+02 2.177e+02 3.803e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-12 00:16:44,462 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:16:46,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=878714.6666666666, ans=0.2 2023-10-12 00:17:03,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=878761.3333333334, ans=0.125 2023-10-12 00:17:19,695 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.24 vs. limit=22.5 2023-10-12 00:17:22,472 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.84 vs. limit=15.0 2023-10-12 00:18:01,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=878994.6666666666, ans=0.0 2023-10-12 00:18:05,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=879041.3333333334, ans=0.04949747468305833 2023-10-12 00:18:09,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=879041.3333333334, ans=0.0 2023-10-12 00:18:32,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=879134.6666666666, ans=0.07 2023-10-12 00:18:33,284 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.755e+02 1.929e+02 2.238e+02 3.251e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-12 00:18:42,977 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.71 vs. limit=22.5 2023-10-12 00:18:51,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=879181.3333333334, ans=0.0 2023-10-12 00:19:04,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=879274.6666666666, ans=0.0 2023-10-12 00:19:14,997 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:19:16,242 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.57 vs. limit=15.0 2023-10-12 00:19:37,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=879414.6666666666, ans=0.0 2023-10-12 00:20:02,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=879508.0, ans=0.0 2023-10-12 00:20:06,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.17 vs. limit=10.0 2023-10-12 00:20:10,239 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:20:14,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=879554.6666666666, ans=0.125 2023-10-12 00:20:22,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=879601.3333333334, ans=0.125 2023-10-12 00:20:24,731 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.693e+02 1.822e+02 2.046e+02 2.928e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-12 00:20:36,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=879648.0, ans=0.05 2023-10-12 00:20:57,714 INFO [train.py:1031] (0/4) Epoch 14, batch 11000, loss[loss=0.187, simple_loss=0.2496, pruned_loss=0.06227, over 12550.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2873, pruned_loss=0.0536, over 32676153.17 frames. ], batch size: 440, lr: 2.48e-03, grad_scale: 32.0 2023-10-12 00:21:04,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=879741.3333333334, ans=0.0 2023-10-12 00:21:15,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=879788.0, ans=0.0 2023-10-12 00:21:27,309 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:21:29,452 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=12.0 2023-10-12 00:21:42,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=879928.0, ans=0.125 2023-10-12 00:21:49,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=879928.0, ans=0.125 2023-10-12 00:22:04,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=879974.6666666666, ans=0.0 2023-10-12 00:22:23,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=880068.0, ans=0.125 2023-10-12 00:22:23,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.872e+02 2.084e+02 2.378e+02 3.452e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-12 00:22:33,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=880114.6666666666, ans=0.125 2023-10-12 00:22:44,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=880161.3333333334, ans=0.125 2023-10-12 00:22:45,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=880161.3333333334, ans=0.0 2023-10-12 00:22:46,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=880161.3333333334, ans=0.1 2023-10-12 00:23:50,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=880348.0, ans=0.0 2023-10-12 00:23:51,036 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.48 vs. limit=12.0 2023-10-12 00:23:58,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=880394.6666666666, ans=0.1 2023-10-12 00:24:10,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=880441.3333333334, ans=0.0 2023-10-12 00:24:17,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=880488.0, ans=0.1 2023-10-12 00:24:25,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=880534.6666666666, ans=0.0 2023-10-12 00:24:29,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.680e+02 1.847e+02 2.090e+02 3.430e+02, threshold=3.695e+02, percent-clipped=0.0 2023-10-12 00:24:37,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=880581.3333333334, ans=0.0 2023-10-12 00:24:51,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=880628.0, ans=0.0 2023-10-12 00:25:14,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=880721.3333333334, ans=0.125 2023-10-12 00:25:20,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.69 vs. limit=15.0 2023-10-12 00:25:35,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=880814.6666666666, ans=0.125 2023-10-12 00:25:42,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=880814.6666666666, ans=0.0 2023-10-12 00:25:50,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=880861.3333333334, ans=0.0 2023-10-12 00:25:50,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=880861.3333333334, ans=0.1 2023-10-12 00:25:51,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=880861.3333333334, ans=0.2 2023-10-12 00:26:11,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=880908.0, ans=0.2 2023-10-12 00:26:12,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=880954.6666666666, ans=0.2 2023-10-12 00:26:13,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=880954.6666666666, ans=0.125 2023-10-12 00:26:30,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.724e+02 1.831e+02 2.113e+02 3.338e+02, threshold=3.662e+02, percent-clipped=0.0 2023-10-12 00:26:41,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=881048.0, ans=0.95 2023-10-12 00:26:49,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.59 vs. limit=5.0 2023-10-12 00:26:54,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-10-12 00:27:06,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=881141.3333333334, ans=0.125 2023-10-12 00:27:08,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=881141.3333333334, ans=0.125 2023-10-12 00:27:24,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=881188.0, ans=0.1 2023-10-12 00:27:25,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=881188.0, ans=0.1 2023-10-12 00:27:26,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=881188.0, ans=0.2 2023-10-12 00:27:29,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-10-12 00:27:32,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=881234.6666666666, ans=0.05 2023-10-12 00:27:50,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=881281.3333333334, ans=0.125 2023-10-12 00:27:53,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=881328.0, ans=0.0 2023-10-12 00:28:03,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-10-12 00:28:31,870 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.677e+02 1.867e+02 2.089e+02 3.084e+02, threshold=3.734e+02, percent-clipped=0.0 2023-10-12 00:28:44,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=881514.6666666666, ans=0.125 2023-10-12 00:28:52,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.79 vs. limit=15.0 2023-10-12 00:28:52,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.72 vs. limit=15.0 2023-10-12 00:29:02,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=881608.0, ans=0.07 2023-10-12 00:29:03,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=881608.0, ans=0.125 2023-10-12 00:29:08,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.77 vs. limit=15.0 2023-10-12 00:29:12,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=881654.6666666666, ans=0.125 2023-10-12 00:29:27,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=881701.3333333334, ans=0.2 2023-10-12 00:29:30,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-10-12 00:29:50,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=881794.6666666666, ans=0.125 2023-10-12 00:30:20,565 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-10-12 00:30:21,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.42 vs. limit=15.0 2023-10-12 00:30:24,451 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-10-12 00:30:27,164 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.797e+02 1.997e+02 2.220e+02 3.073e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-12 00:30:28,343 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:30:50,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.94 vs. limit=15.0 2023-10-12 00:30:52,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=882028.0, ans=0.125 2023-10-12 00:30:56,697 INFO [train.py:1031] (0/4) Epoch 14, batch 11500, loss[loss=0.2185, simple_loss=0.3102, pruned_loss=0.06344, over 16591.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2869, pruned_loss=0.05334, over 32712205.42 frames. ], batch size: 266, lr: 2.48e-03, grad_scale: 32.0 2023-10-12 00:30:58,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=882074.6666666666, ans=0.125 2023-10-12 00:31:13,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-10-12 00:31:18,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=882168.0, ans=0.0 2023-10-12 00:31:29,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=882214.6666666666, ans=0.2 2023-10-12 00:31:41,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=882261.3333333334, ans=0.1 2023-10-12 00:31:59,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=882308.0, ans=0.125 2023-10-12 00:32:13,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=882354.6666666666, ans=0.0 2023-10-12 00:32:13,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=882354.6666666666, ans=0.125 2023-10-12 00:32:23,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.746e+02 1.940e+02 2.216e+02 3.510e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-12 00:32:26,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=882401.3333333334, ans=0.95 2023-10-12 00:32:37,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=882448.0, ans=0.035 2023-10-12 00:32:37,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=882448.0, ans=0.0 2023-10-12 00:32:46,573 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-10-12 00:32:48,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=882494.6666666666, ans=0.0 2023-10-12 00:32:56,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=882494.6666666666, ans=0.125 2023-10-12 00:33:27,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=882634.6666666666, ans=0.125 2023-10-12 00:33:33,519 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-10-12 00:33:40,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=882681.3333333334, ans=0.2 2023-10-12 00:33:52,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-10-12 00:34:16,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-10-12 00:34:24,343 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.612e+02 1.779e+02 1.978e+02 2.703e+02, threshold=3.557e+02, percent-clipped=0.0 2023-10-12 00:34:29,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=882868.0, ans=0.0 2023-10-12 00:34:29,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=882868.0, ans=0.2 2023-10-12 00:34:29,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.73 vs. limit=15.0 2023-10-12 00:34:44,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=882961.3333333334, ans=0.2 2023-10-12 00:35:05,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=883054.6666666666, ans=0.1 2023-10-12 00:35:54,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=883241.3333333334, ans=0.125 2023-10-12 00:36:00,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=883241.3333333334, ans=0.125 2023-10-12 00:36:13,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=883288.0, ans=0.0 2023-10-12 00:36:16,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=883288.0, ans=0.125 2023-10-12 00:36:28,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.684e+02 1.864e+02 2.084e+02 2.967e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-12 00:36:43,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=883381.3333333334, ans=0.0 2023-10-12 00:36:48,432 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.05 vs. limit=15.0 2023-10-12 00:36:52,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=883428.0, ans=0.2 2023-10-12 00:37:19,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=883521.3333333334, ans=0.07 2023-10-12 00:37:47,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=883614.6666666666, ans=0.07 2023-10-12 00:37:54,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=883661.3333333334, ans=0.125 2023-10-12 00:37:56,829 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:38:13,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=883754.6666666666, ans=0.1 2023-10-12 00:38:21,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=883754.6666666666, ans=0.125 2023-10-12 00:38:25,133 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.86 vs. limit=12.0 2023-10-12 00:38:28,348 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.726e+02 1.869e+02 2.062e+02 2.981e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-12 00:38:49,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=883894.6666666666, ans=0.125 2023-10-12 00:38:54,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=883894.6666666666, ans=0.1 2023-10-12 00:39:04,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=883941.3333333334, ans=0.125 2023-10-12 00:39:08,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=883941.3333333334, ans=0.02 2023-10-12 00:39:11,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=883941.3333333334, ans=10.0 2023-10-12 00:39:18,650 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-10-12 00:39:19,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=883988.0, ans=0.0 2023-10-12 00:39:25,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-10-12 00:39:35,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=884081.3333333334, ans=0.04949747468305833 2023-10-12 00:39:59,478 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.21 vs. limit=15.0 2023-10-12 00:40:00,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=884128.0, ans=0.125 2023-10-12 00:40:08,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=884174.6666666666, ans=0.125 2023-10-12 00:40:29,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.236e+02 1.657e+02 1.815e+02 2.009e+02 2.623e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-12 00:40:51,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=884361.3333333334, ans=0.0 2023-10-12 00:40:56,930 INFO [train.py:1031] (0/4) Epoch 14, batch 12000, loss[loss=0.18, simple_loss=0.2706, pruned_loss=0.04467, over 16041.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2869, pruned_loss=0.05297, over 32759641.42 frames. ], batch size: 43, lr: 2.48e-03, grad_scale: 16.0 2023-10-12 00:41:06,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=884408.0, ans=0.125 2023-10-12 00:41:07,085 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.57 vs. limit=15.0 2023-10-12 00:41:21,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=884501.3333333334, ans=0.1 2023-10-12 00:41:25,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=884501.3333333334, ans=0.1 2023-10-12 00:42:01,623 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-12 00:42:12,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=884688.0, ans=0.0 2023-10-12 00:42:14,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=884688.0, ans=0.1 2023-10-12 00:42:28,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.634e+02 1.824e+02 2.015e+02 2.957e+02, threshold=3.649e+02, percent-clipped=0.0 2023-10-12 00:42:41,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=884781.3333333334, ans=0.125 2023-10-12 00:42:47,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=884828.0, ans=0.0 2023-10-12 00:42:53,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=884828.0, ans=0.125 2023-10-12 00:43:09,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=884921.3333333334, ans=0.1 2023-10-12 00:43:13,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=884921.3333333334, ans=0.125 2023-10-12 00:43:34,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.54 vs. limit=15.0 2023-10-12 00:43:43,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=885061.3333333334, ans=0.05 2023-10-12 00:44:00,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=885108.0, ans=0.0 2023-10-12 00:44:19,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=885201.3333333334, ans=10.0 2023-10-12 00:44:20,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.634e+02 1.794e+02 2.044e+02 3.142e+02, threshold=3.587e+02, percent-clipped=0.0 2023-10-12 00:44:50,812 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:44:59,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-10-12 00:45:03,881 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:45:11,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=885434.6666666666, ans=0.125 2023-10-12 00:45:21,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-10-12 00:45:25,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=885481.3333333334, ans=0.1 2023-10-12 00:45:35,698 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=12.0 2023-10-12 00:45:44,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=885574.6666666666, ans=0.0 2023-10-12 00:45:48,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=885574.6666666666, ans=0.125 2023-10-12 00:45:54,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=885621.3333333334, ans=0.0 2023-10-12 00:45:55,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=885621.3333333334, ans=0.125 2023-10-12 00:46:08,088 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:46:10,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=885668.0, ans=0.2 2023-10-12 00:46:10,661 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.833e+02 2.055e+02 2.426e+02 4.382e+02, threshold=4.110e+02, percent-clipped=3.0 2023-10-12 00:46:14,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=885668.0, ans=0.2 2023-10-12 00:46:19,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=885714.6666666666, ans=0.0 2023-10-12 00:46:28,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=885714.6666666666, ans=0.2 2023-10-12 00:46:34,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-12 00:46:55,997 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=22.5 2023-10-12 00:46:57,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=885854.6666666666, ans=0.1 2023-10-12 00:47:05,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=885901.3333333334, ans=0.1 2023-10-12 00:47:22,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=885948.0, ans=0.07 2023-10-12 00:47:26,765 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.27 vs. limit=6.0 2023-10-12 00:47:35,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=886041.3333333334, ans=0.1 2023-10-12 00:47:45,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=886041.3333333334, ans=10.0 2023-10-12 00:47:55,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=886088.0, ans=0.1 2023-10-12 00:48:07,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.772e+02 1.986e+02 2.211e+02 3.059e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-12 00:48:15,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=886181.3333333334, ans=0.1 2023-10-12 00:48:25,806 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:48:31,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.63 vs. limit=15.0 2023-10-12 00:48:33,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=886228.0, ans=0.125 2023-10-12 00:48:50,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=886321.3333333334, ans=0.1 2023-10-12 00:48:51,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=886321.3333333334, ans=0.2 2023-10-12 00:49:02,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=886368.0, ans=0.0 2023-10-12 00:49:05,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=886368.0, ans=0.0 2023-10-12 00:49:06,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=886368.0, ans=0.125 2023-10-12 00:49:15,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=886414.6666666666, ans=0.0 2023-10-12 00:49:27,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=886461.3333333334, ans=0.2 2023-10-12 00:49:59,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=886601.3333333334, ans=0.0 2023-10-12 00:50:02,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.792e+02 1.959e+02 2.367e+02 3.427e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-12 00:50:03,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=886601.3333333334, ans=0.0 2023-10-12 00:50:19,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-10-12 00:50:23,408 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.64 vs. limit=15.0 2023-10-12 00:50:28,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=886694.6666666666, ans=0.125 2023-10-12 00:50:32,063 INFO [train.py:1031] (0/4) Epoch 14, batch 12500, loss[loss=0.1981, simple_loss=0.2862, pruned_loss=0.05506, over 15743.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2866, pruned_loss=0.05295, over 32778613.48 frames. ], batch size: 35, lr: 2.47e-03, grad_scale: 32.0 2023-10-12 00:50:32,741 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.60 vs. limit=15.0 2023-10-12 00:50:47,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=886788.0, ans=0.0 2023-10-12 00:50:47,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=886788.0, ans=0.07 2023-10-12 00:50:48,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=886788.0, ans=0.125 2023-10-12 00:50:59,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=886834.6666666666, ans=0.125 2023-10-12 00:51:07,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=886881.3333333334, ans=0.125 2023-10-12 00:51:09,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=886881.3333333334, ans=0.0 2023-10-12 00:51:11,096 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.19 vs. limit=15.0 2023-10-12 00:51:26,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-10-12 00:51:47,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-12 00:51:50,627 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-10-12 00:51:51,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=887068.0, ans=0.0 2023-10-12 00:51:53,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.700e+02 1.887e+02 2.071e+02 3.228e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-12 00:51:56,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=887068.0, ans=0.2 2023-10-12 00:51:58,271 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.07 vs. limit=10.0 2023-10-12 00:51:59,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=887114.6666666666, ans=0.125 2023-10-12 00:52:04,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=887114.6666666666, ans=0.0 2023-10-12 00:52:18,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=887161.3333333334, ans=0.05 2023-10-12 00:52:32,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-10-12 00:52:36,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=887254.6666666666, ans=0.2 2023-10-12 00:52:38,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=887254.6666666666, ans=0.125 2023-10-12 00:52:44,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=887301.3333333334, ans=0.0 2023-10-12 00:52:59,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=887348.0, ans=0.125 2023-10-12 00:53:03,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.86 vs. limit=15.0 2023-10-12 00:53:18,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=887441.3333333334, ans=0.0 2023-10-12 00:53:31,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=887441.3333333334, ans=0.125 2023-10-12 00:53:34,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=887488.0, ans=0.0 2023-10-12 00:53:50,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-10-12 00:53:50,831 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.677e+02 1.838e+02 2.019e+02 2.930e+02, threshold=3.675e+02, percent-clipped=0.0 2023-10-12 00:54:01,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=887581.3333333334, ans=0.125 2023-10-12 00:54:03,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=887581.3333333334, ans=10.0 2023-10-12 00:54:12,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=887628.0, ans=0.125 2023-10-12 00:54:16,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=887674.6666666666, ans=0.0 2023-10-12 00:54:16,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=887674.6666666666, ans=0.125 2023-10-12 00:54:25,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=887721.3333333334, ans=0.05 2023-10-12 00:54:36,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=887721.3333333334, ans=0.0 2023-10-12 00:55:28,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=887954.6666666666, ans=22.5 2023-10-12 00:55:41,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.756e+02 1.960e+02 2.272e+02 3.286e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-12 00:55:54,572 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 00:56:01,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=888094.6666666666, ans=0.0 2023-10-12 00:56:01,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=888094.6666666666, ans=0.125 2023-10-12 00:56:01,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=888094.6666666666, ans=0.2 2023-10-12 00:56:07,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=888141.3333333334, ans=0.125 2023-10-12 00:56:31,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=888234.6666666666, ans=0.125 2023-10-12 00:56:52,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=888328.0, ans=0.125 2023-10-12 00:57:23,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=888421.3333333334, ans=0.1 2023-10-12 00:57:33,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.719e+02 1.917e+02 2.094e+02 3.080e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-12 00:57:43,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=888514.6666666666, ans=0.125 2023-10-12 00:58:01,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=888608.0, ans=0.0 2023-10-12 00:58:08,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=888608.0, ans=0.0 2023-10-12 00:58:11,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=888654.6666666666, ans=0.0 2023-10-12 00:58:13,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=888654.6666666666, ans=0.2 2023-10-12 00:58:14,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=888654.6666666666, ans=0.0 2023-10-12 00:58:24,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=888701.3333333334, ans=0.125 2023-10-12 00:58:54,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=888841.3333333334, ans=0.04949747468305833 2023-10-12 00:58:56,276 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=15.0 2023-10-12 00:59:02,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=888841.3333333334, ans=0.125 2023-10-12 00:59:22,237 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.661e+02 1.820e+02 2.036e+02 2.890e+02, threshold=3.639e+02, percent-clipped=0.0 2023-10-12 00:59:33,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=888981.3333333334, ans=0.1 2023-10-12 00:59:33,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=888981.3333333334, ans=0.0 2023-10-12 00:59:40,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=889028.0, ans=0.0 2023-10-12 00:59:47,263 INFO [train.py:1031] (0/4) Epoch 14, batch 13000, loss[loss=0.2042, simple_loss=0.2934, pruned_loss=0.0575, over 16376.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2872, pruned_loss=0.05304, over 32803392.04 frames. ], batch size: 50, lr: 2.47e-03, grad_scale: 32.0 2023-10-12 00:59:50,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=889074.6666666666, ans=0.0 2023-10-12 01:00:10,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=889168.0, ans=0.2 2023-10-12 01:00:25,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=889214.6666666666, ans=0.0 2023-10-12 01:00:28,568 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.63 vs. limit=15.0 2023-10-12 01:00:32,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=889214.6666666666, ans=0.125 2023-10-12 01:00:42,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=889261.3333333334, ans=0.1 2023-10-12 01:00:51,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=889308.0, ans=0.0 2023-10-12 01:01:15,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=889401.3333333334, ans=0.2 2023-10-12 01:01:21,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.683e+02 1.846e+02 2.093e+02 2.698e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-12 01:01:23,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.81 vs. limit=15.0 2023-10-12 01:01:41,878 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-12 01:01:55,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=889541.3333333334, ans=0.125 2023-10-12 01:02:06,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=889588.0, ans=0.125 2023-10-12 01:02:18,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=889634.6666666666, ans=0.125 2023-10-12 01:02:20,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=889634.6666666666, ans=0.2 2023-10-12 01:02:29,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=889681.3333333334, ans=0.1 2023-10-12 01:02:29,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=889681.3333333334, ans=0.0 2023-10-12 01:02:37,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=889728.0, ans=0.2 2023-10-12 01:02:52,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=889774.6666666666, ans=0.125 2023-10-12 01:02:54,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=889774.6666666666, ans=0.125 2023-10-12 01:03:04,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=889821.3333333334, ans=0.125 2023-10-12 01:03:14,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.688e+02 1.906e+02 2.081e+02 2.924e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-12 01:03:28,356 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.21 vs. limit=15.0 2023-10-12 01:03:38,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=889961.3333333334, ans=0.125 2023-10-12 01:03:43,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=889961.3333333334, ans=0.07 2023-10-12 01:03:44,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=889961.3333333334, ans=0.1 2023-10-12 01:03:46,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.02 vs. limit=15.0 2023-10-12 01:03:52,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=890008.0, ans=0.125 2023-10-12 01:04:02,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=890054.6666666666, ans=0.1 2023-10-12 01:04:10,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=890101.3333333334, ans=0.125 2023-10-12 01:04:34,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=890194.6666666666, ans=0.5 2023-10-12 01:04:39,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=890194.6666666666, ans=0.125 2023-10-12 01:05:13,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.749e+02 1.973e+02 2.249e+02 3.003e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-12 01:05:15,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=890334.6666666666, ans=0.0 2023-10-12 01:05:27,967 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.43 vs. limit=15.0 2023-10-12 01:05:34,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=890428.0, ans=0.1 2023-10-12 01:05:49,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=890474.6666666666, ans=0.0 2023-10-12 01:05:55,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=890521.3333333334, ans=0.1 2023-10-12 01:06:07,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=890568.0, ans=0.0 2023-10-12 01:06:20,683 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.40 vs. limit=15.0 2023-10-12 01:06:30,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=890661.3333333334, ans=0.2 2023-10-12 01:06:31,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=890661.3333333334, ans=0.1 2023-10-12 01:06:31,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=890661.3333333334, ans=0.1 2023-10-12 01:06:35,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=890708.0, ans=0.125 2023-10-12 01:06:36,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=890708.0, ans=0.125 2023-10-12 01:06:39,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=890708.0, ans=0.125 2023-10-12 01:06:39,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=890708.0, ans=0.1 2023-10-12 01:06:41,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=890708.0, ans=0.125 2023-10-12 01:06:53,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=890754.6666666666, ans=0.0 2023-10-12 01:07:04,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=890801.3333333334, ans=0.1 2023-10-12 01:07:05,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.725e+02 1.866e+02 2.068e+02 2.711e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-12 01:07:09,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=890848.0, ans=0.125 2023-10-12 01:07:16,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=890848.0, ans=0.2 2023-10-12 01:07:18,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=890848.0, ans=0.07 2023-10-12 01:07:30,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=890941.3333333334, ans=0.0 2023-10-12 01:07:34,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=890941.3333333334, ans=0.125 2023-10-12 01:08:05,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=891081.3333333334, ans=0.1 2023-10-12 01:08:13,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=891081.3333333334, ans=0.0 2023-10-12 01:08:18,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=891128.0, ans=0.0 2023-10-12 01:08:19,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=891128.0, ans=0.2 2023-10-12 01:08:24,483 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-10-12 01:08:29,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=891174.6666666666, ans=0.0 2023-10-12 01:08:31,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=891174.6666666666, ans=0.125 2023-10-12 01:08:33,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=891174.6666666666, ans=0.125 2023-10-12 01:08:56,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.678e+02 1.829e+02 2.107e+02 2.945e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-12 01:09:17,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=891361.3333333334, ans=0.5 2023-10-12 01:09:19,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=891361.3333333334, ans=0.125 2023-10-12 01:09:19,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=891361.3333333334, ans=0.125 2023-10-12 01:09:20,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=891361.3333333334, ans=0.0 2023-10-12 01:09:23,071 INFO [train.py:1031] (0/4) Epoch 14, batch 13500, loss[loss=0.1694, simple_loss=0.2672, pruned_loss=0.03577, over 16925.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2867, pruned_loss=0.05285, over 32816872.04 frames. ], batch size: 104, lr: 2.47e-03, grad_scale: 16.0 2023-10-12 01:09:23,504 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-10-12 01:09:34,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=891454.6666666666, ans=0.125 2023-10-12 01:09:36,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=22.5 2023-10-12 01:09:57,316 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.02 vs. limit=15.0 2023-10-12 01:09:58,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=891548.0, ans=0.125 2023-10-12 01:10:06,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=891548.0, ans=0.07 2023-10-12 01:10:26,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=891641.3333333334, ans=0.1 2023-10-12 01:10:52,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.702e+02 1.952e+02 2.238e+02 3.226e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-12 01:11:03,877 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.14 vs. limit=22.5 2023-10-12 01:11:33,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=891921.3333333334, ans=0.125 2023-10-12 01:11:41,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=891968.0, ans=15.0 2023-10-12 01:11:57,773 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:12:11,780 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-14.pt 2023-10-12 01:12:41,970 INFO [train.py:1031] (0/4) Epoch 15, batch 0, loss[loss=0.1685, simple_loss=0.259, pruned_loss=0.03904, over 16493.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.259, pruned_loss=0.03904, over 16493.00 frames. ], batch size: 266, lr: 2.38e-03, grad_scale: 32.0 2023-10-12 01:12:41,971 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-12 01:12:50,461 INFO [train.py:1063] (0/4) Epoch 15, validation: loss=0.2176, simple_loss=0.3045, pruned_loss=0.06534, over 1020973.00 frames. 2023-10-12 01:12:50,462 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-12 01:12:53,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=892131.3333333334, ans=0.09899494936611666 2023-10-12 01:12:58,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=892131.3333333334, ans=0.125 2023-10-12 01:13:14,715 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.767e+02 2.003e+02 2.262e+02 3.139e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-12 01:13:15,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=892224.6666666666, ans=0.2 2023-10-12 01:13:36,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=892318.0, ans=0.125 2023-10-12 01:13:42,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=892318.0, ans=0.125 2023-10-12 01:13:57,184 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-10-12 01:14:32,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=892551.3333333334, ans=0.125 2023-10-12 01:14:43,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=892598.0, ans=0.125 2023-10-12 01:14:51,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=892598.0, ans=0.1 2023-10-12 01:14:52,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=892598.0, ans=0.125 2023-10-12 01:15:06,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.659e+02 1.896e+02 2.141e+02 3.016e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-12 01:15:17,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=892738.0, ans=0.125 2023-10-12 01:15:33,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.35 vs. limit=12.0 2023-10-12 01:15:47,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=892831.3333333334, ans=0.125 2023-10-12 01:16:03,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=892924.6666666666, ans=0.0 2023-10-12 01:16:17,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=892971.3333333334, ans=0.0 2023-10-12 01:16:22,895 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-10-12 01:16:47,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=893111.3333333334, ans=0.0 2023-10-12 01:16:47,351 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.23 vs. limit=15.0 2023-10-12 01:17:00,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.761e+02 1.935e+02 2.147e+02 4.385e+02, threshold=3.870e+02, percent-clipped=1.0 2023-10-12 01:17:03,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=893158.0, ans=0.1 2023-10-12 01:17:11,480 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:18:02,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=893391.3333333334, ans=0.125 2023-10-12 01:18:19,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=893438.0, ans=0.1 2023-10-12 01:18:40,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=893531.3333333334, ans=0.05 2023-10-12 01:18:42,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=893531.3333333334, ans=0.125 2023-10-12 01:18:49,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=893578.0, ans=0.125 2023-10-12 01:18:49,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=893578.0, ans=0.1 2023-10-12 01:18:57,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=893624.6666666666, ans=0.125 2023-10-12 01:18:59,162 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.692e+02 1.850e+02 2.090e+02 3.456e+02, threshold=3.699e+02, percent-clipped=0.0 2023-10-12 01:19:44,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=893811.3333333334, ans=0.2 2023-10-12 01:19:59,909 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.15 vs. limit=15.0 2023-10-12 01:20:02,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=893904.6666666666, ans=0.125 2023-10-12 01:20:03,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=893904.6666666666, ans=0.0 2023-10-12 01:20:06,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=893904.6666666666, ans=0.0 2023-10-12 01:20:08,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.47 vs. limit=15.0 2023-10-12 01:20:10,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=893951.3333333334, ans=0.125 2023-10-12 01:20:30,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=893998.0, ans=0.125 2023-10-12 01:20:35,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=894044.6666666666, ans=0.0 2023-10-12 01:20:45,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=894091.3333333334, ans=0.1 2023-10-12 01:20:46,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.727e+02 1.933e+02 2.131e+02 3.001e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-12 01:21:00,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.79 vs. limit=10.0 2023-10-12 01:21:05,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=894184.6666666666, ans=0.125 2023-10-12 01:21:15,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=894184.6666666666, ans=0.0 2023-10-12 01:21:30,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=12.0 2023-10-12 01:21:31,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=894278.0, ans=0.125 2023-10-12 01:21:40,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=894324.6666666666, ans=0.05 2023-10-12 01:22:06,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=894418.0, ans=0.125 2023-10-12 01:22:14,577 INFO [train.py:1031] (0/4) Epoch 15, batch 500, loss[loss=0.2059, simple_loss=0.2886, pruned_loss=0.06159, over 16114.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2874, pruned_loss=0.054, over 7267698.50 frames. ], batch size: 43, lr: 2.38e-03, grad_scale: 32.0 2023-10-12 01:22:35,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=894558.0, ans=0.0 2023-10-12 01:22:40,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.749e+02 1.950e+02 2.198e+02 3.048e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-12 01:22:52,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=894604.6666666666, ans=0.07 2023-10-12 01:22:53,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=894604.6666666666, ans=10.0 2023-10-12 01:23:01,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-10-12 01:23:16,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=894698.0, ans=0.0 2023-10-12 01:23:30,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=894791.3333333334, ans=0.125 2023-10-12 01:23:32,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=894791.3333333334, ans=0.0 2023-10-12 01:23:50,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=894838.0, ans=0.125 2023-10-12 01:23:54,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=894884.6666666666, ans=0.1 2023-10-12 01:24:00,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=894884.6666666666, ans=0.125 2023-10-12 01:24:16,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=894978.0, ans=0.05 2023-10-12 01:24:20,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=894978.0, ans=0.1 2023-10-12 01:24:27,609 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.771e+02 1.956e+02 2.138e+02 2.981e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-12 01:24:40,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=895071.3333333334, ans=0.1 2023-10-12 01:24:42,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=895071.3333333334, ans=0.125 2023-10-12 01:24:46,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=895118.0, ans=0.2 2023-10-12 01:25:02,571 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.51 vs. limit=15.0 2023-10-12 01:25:09,663 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-10-12 01:25:11,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=895211.3333333334, ans=0.125 2023-10-12 01:25:28,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=895258.0, ans=0.2 2023-10-12 01:25:31,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=895258.0, ans=0.125 2023-10-12 01:25:31,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=895258.0, ans=0.125 2023-10-12 01:25:35,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=895304.6666666666, ans=0.1 2023-10-12 01:25:38,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=895304.6666666666, ans=0.09899494936611666 2023-10-12 01:25:43,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=895351.3333333334, ans=0.2 2023-10-12 01:25:49,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=895351.3333333334, ans=0.1 2023-10-12 01:25:59,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=895398.0, ans=0.125 2023-10-12 01:26:18,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.850e+02 2.033e+02 2.192e+02 3.175e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-12 01:26:20,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=895491.3333333334, ans=0.1 2023-10-12 01:26:23,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.34 vs. limit=15.0 2023-10-12 01:26:36,723 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.37 vs. limit=22.5 2023-10-12 01:26:38,438 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.37 vs. limit=6.0 2023-10-12 01:26:39,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=895538.0, ans=0.5 2023-10-12 01:26:42,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=895584.6666666666, ans=0.125 2023-10-12 01:26:46,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=895584.6666666666, ans=0.125 2023-10-12 01:26:47,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=895584.6666666666, ans=0.125 2023-10-12 01:26:48,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=895584.6666666666, ans=0.125 2023-10-12 01:26:58,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=895631.3333333334, ans=0.05 2023-10-12 01:27:07,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=895678.0, ans=10.0 2023-10-12 01:27:11,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=895678.0, ans=0.125 2023-10-12 01:27:27,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=895771.3333333334, ans=0.125 2023-10-12 01:27:27,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=895771.3333333334, ans=0.0 2023-10-12 01:27:30,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=895771.3333333334, ans=10.0 2023-10-12 01:27:58,701 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:28:08,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=895958.0, ans=0.125 2023-10-12 01:28:09,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.690e+02 1.842e+02 2.084e+02 2.627e+02, threshold=3.685e+02, percent-clipped=0.0 2023-10-12 01:28:16,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=895958.0, ans=0.125 2023-10-12 01:28:18,067 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-192000.pt 2023-10-12 01:29:15,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=896191.3333333334, ans=0.125 2023-10-12 01:29:57,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=896378.0, ans=0.125 2023-10-12 01:30:01,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.22 vs. limit=22.5 2023-10-12 01:30:10,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.809e+02 2.060e+02 2.485e+02 3.626e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-12 01:30:14,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-10-12 01:30:14,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=896424.6666666666, ans=0.1 2023-10-12 01:30:15,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=896424.6666666666, ans=0.125 2023-10-12 01:30:31,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.59 vs. limit=6.0 2023-10-12 01:30:46,135 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:30:46,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=896564.6666666666, ans=0.2 2023-10-12 01:31:08,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=896658.0, ans=0.125 2023-10-12 01:31:09,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=896658.0, ans=0.05 2023-10-12 01:31:17,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=896704.6666666666, ans=0.1 2023-10-12 01:31:25,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=896704.6666666666, ans=0.125 2023-10-12 01:31:34,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-10-12 01:31:35,003 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-12 01:31:37,385 INFO [train.py:1031] (0/4) Epoch 15, batch 1000, loss[loss=0.1804, simple_loss=0.2749, pruned_loss=0.04295, over 16588.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2884, pruned_loss=0.05423, over 12922524.47 frames. ], batch size: 66, lr: 2.37e-03, grad_scale: 32.0 2023-10-12 01:31:38,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=896798.0, ans=0.125 2023-10-12 01:31:38,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=896798.0, ans=0.1 2023-10-12 01:31:45,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=896798.0, ans=0.125 2023-10-12 01:31:56,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=896844.6666666666, ans=0.1 2023-10-12 01:31:57,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=896844.6666666666, ans=0.0 2023-10-12 01:32:01,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.287e+02 1.658e+02 1.830e+02 2.117e+02 2.891e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-12 01:32:07,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=896891.3333333334, ans=0.1 2023-10-12 01:32:39,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=897031.3333333334, ans=0.125 2023-10-12 01:32:43,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=897078.0, ans=0.0 2023-10-12 01:33:03,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.36 vs. limit=15.0 2023-10-12 01:33:30,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=897264.6666666666, ans=0.015 2023-10-12 01:33:30,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=897264.6666666666, ans=0.125 2023-10-12 01:33:31,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=897264.6666666666, ans=0.0 2023-10-12 01:33:33,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=897264.6666666666, ans=0.125 2023-10-12 01:33:43,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=897311.3333333334, ans=0.0 2023-10-12 01:33:56,633 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.722e+02 1.853e+02 2.064e+02 3.079e+02, threshold=3.707e+02, percent-clipped=0.0 2023-10-12 01:34:00,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=897358.0, ans=0.1 2023-10-12 01:34:17,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.69 vs. limit=15.0 2023-10-12 01:34:20,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=897451.3333333334, ans=0.0 2023-10-12 01:34:23,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=897451.3333333334, ans=0.0 2023-10-12 01:34:35,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=897498.0, ans=0.0 2023-10-12 01:35:06,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=897591.3333333334, ans=0.0 2023-10-12 01:35:07,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=897591.3333333334, ans=0.1 2023-10-12 01:35:12,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=897638.0, ans=0.2 2023-10-12 01:35:15,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=897638.0, ans=0.1 2023-10-12 01:35:17,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=897638.0, ans=0.125 2023-10-12 01:35:18,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=897684.6666666666, ans=0.0 2023-10-12 01:35:56,788 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.11 vs. limit=15.0 2023-10-12 01:35:58,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.678e+02 1.876e+02 2.275e+02 3.331e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-12 01:36:14,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=897871.3333333334, ans=0.125 2023-10-12 01:36:18,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=897918.0, ans=0.125 2023-10-12 01:36:23,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=897918.0, ans=0.125 2023-10-12 01:36:40,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=898011.3333333334, ans=0.125 2023-10-12 01:36:43,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=898011.3333333334, ans=0.125 2023-10-12 01:36:46,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=898011.3333333334, ans=0.0 2023-10-12 01:36:54,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=898058.0, ans=0.125 2023-10-12 01:37:02,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=898104.6666666666, ans=0.0 2023-10-12 01:37:04,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=898104.6666666666, ans=0.09899494936611666 2023-10-12 01:37:07,019 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:37:18,593 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.68 vs. limit=12.0 2023-10-12 01:37:21,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=898198.0, ans=0.0 2023-10-12 01:37:24,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=898198.0, ans=0.125 2023-10-12 01:37:46,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.738e+02 2.035e+02 2.186e+02 3.207e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-12 01:37:59,085 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.87 vs. limit=15.0 2023-10-12 01:38:06,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=898384.6666666666, ans=0.1 2023-10-12 01:38:21,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=898431.3333333334, ans=0.0 2023-10-12 01:38:30,247 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.94 vs. limit=10.0 2023-10-12 01:38:36,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-10-12 01:38:47,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=898571.3333333334, ans=0.125 2023-10-12 01:38:57,572 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.01 vs. limit=15.0 2023-10-12 01:39:06,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=898618.0, ans=0.0 2023-10-12 01:39:18,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=898664.6666666666, ans=0.2 2023-10-12 01:39:30,488 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.19 vs. limit=15.0 2023-10-12 01:39:37,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.707e+02 1.935e+02 2.195e+02 3.279e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 01:39:40,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=898758.0, ans=0.2 2023-10-12 01:39:55,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=898851.3333333334, ans=0.1 2023-10-12 01:40:10,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=898898.0, ans=0.0 2023-10-12 01:40:24,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=898944.6666666666, ans=22.5 2023-10-12 01:40:36,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=898991.3333333334, ans=0.0 2023-10-12 01:40:40,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=899038.0, ans=0.0 2023-10-12 01:41:04,648 INFO [train.py:1031] (0/4) Epoch 15, batch 1500, loss[loss=0.2113, simple_loss=0.2906, pruned_loss=0.06598, over 16388.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2863, pruned_loss=0.05315, over 17318156.01 frames. ], batch size: 50, lr: 2.37e-03, grad_scale: 32.0 2023-10-12 01:41:09,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=899131.3333333334, ans=0.125 2023-10-12 01:41:09,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=899131.3333333334, ans=0.0 2023-10-12 01:41:16,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=899178.0, ans=0.0 2023-10-12 01:41:29,954 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=22.5 2023-10-12 01:41:30,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.757e+02 1.959e+02 2.168e+02 2.857e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-12 01:41:55,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=899318.0, ans=0.1 2023-10-12 01:42:09,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=899364.6666666666, ans=0.0 2023-10-12 01:42:40,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=899504.6666666666, ans=0.125 2023-10-12 01:43:11,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=899644.6666666666, ans=0.125 2023-10-12 01:43:16,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=899644.6666666666, ans=0.125 2023-10-12 01:43:26,166 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:43:26,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.720e+02 1.876e+02 2.093e+02 3.022e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-12 01:43:31,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=899691.3333333334, ans=0.035 2023-10-12 01:43:31,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=899691.3333333334, ans=0.125 2023-10-12 01:43:59,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=899784.6666666666, ans=0.0 2023-10-12 01:44:23,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=899878.0, ans=0.125 2023-10-12 01:44:46,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=899971.3333333334, ans=0.0 2023-10-12 01:44:55,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=900018.0, ans=0.2 2023-10-12 01:45:14,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=900111.3333333334, ans=0.09899494936611666 2023-10-12 01:45:21,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=900158.0, ans=0.125 2023-10-12 01:45:25,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.691e+02 1.868e+02 2.093e+02 3.176e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-12 01:45:25,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=900158.0, ans=0.125 2023-10-12 01:45:25,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=900158.0, ans=0.0 2023-10-12 01:45:27,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-10-12 01:45:32,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=900204.6666666666, ans=0.1 2023-10-12 01:45:44,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.62 vs. limit=15.0 2023-10-12 01:45:50,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.49 vs. limit=6.0 2023-10-12 01:46:18,782 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:46:18,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=900391.3333333334, ans=0.025 2023-10-12 01:46:26,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=900391.3333333334, ans=0.125 2023-10-12 01:46:33,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=900438.0, ans=0.125 2023-10-12 01:46:36,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=900438.0, ans=0.125 2023-10-12 01:46:41,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=900438.0, ans=0.0 2023-10-12 01:46:45,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=15.0 2023-10-12 01:46:53,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=900484.6666666666, ans=0.2 2023-10-12 01:46:57,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=900531.3333333334, ans=0.1 2023-10-12 01:47:03,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=900531.3333333334, ans=0.125 2023-10-12 01:47:04,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=900531.3333333334, ans=0.125 2023-10-12 01:47:04,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=900531.3333333334, ans=0.125 2023-10-12 01:47:16,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=900624.6666666666, ans=0.125 2023-10-12 01:47:22,314 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.742e+02 1.876e+02 1.985e+02 2.725e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-12 01:47:22,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=900624.6666666666, ans=0.125 2023-10-12 01:47:38,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=900671.3333333334, ans=0.05 2023-10-12 01:47:39,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=22.5 2023-10-12 01:47:42,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=900718.0, ans=0.125 2023-10-12 01:47:42,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=900718.0, ans=0.125 2023-10-12 01:47:52,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=900764.6666666666, ans=0.0 2023-10-12 01:47:54,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=900764.6666666666, ans=0.0 2023-10-12 01:48:02,821 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.47 vs. limit=12.0 2023-10-12 01:48:08,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=900811.3333333334, ans=0.125 2023-10-12 01:48:13,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.53 vs. limit=15.0 2023-10-12 01:48:29,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=900904.6666666666, ans=0.125 2023-10-12 01:48:43,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=900951.3333333334, ans=0.0 2023-10-12 01:48:44,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=900951.3333333334, ans=0.125 2023-10-12 01:48:55,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=900998.0, ans=0.025 2023-10-12 01:49:14,530 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.864e+02 2.103e+02 2.403e+02 3.132e+02, threshold=4.206e+02, percent-clipped=0.0 2023-10-12 01:49:15,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=901091.3333333334, ans=0.04949747468305833 2023-10-12 01:49:42,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=901184.6666666666, ans=0.125 2023-10-12 01:50:23,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=901324.6666666666, ans=0.125 2023-10-12 01:50:31,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=901371.3333333334, ans=0.1 2023-10-12 01:50:33,934 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-10-12 01:50:45,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=901418.0, ans=0.125 2023-10-12 01:50:52,240 INFO [train.py:1031] (0/4) Epoch 15, batch 2000, loss[loss=0.1911, simple_loss=0.2896, pruned_loss=0.04631, over 16928.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2869, pruned_loss=0.05317, over 20750645.23 frames. ], batch size: 93, lr: 2.37e-03, grad_scale: 32.0 2023-10-12 01:51:21,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.714e+02 1.842e+02 2.115e+02 2.780e+02, threshold=3.684e+02, percent-clipped=0.0 2023-10-12 01:51:34,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=901604.6666666666, ans=0.0 2023-10-12 01:51:38,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=901604.6666666666, ans=0.0 2023-10-12 01:51:58,913 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.24 vs. limit=22.5 2023-10-12 01:52:24,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=901791.3333333334, ans=0.0 2023-10-12 01:52:26,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=901791.3333333334, ans=0.0 2023-10-12 01:53:44,755 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.644e+02 1.803e+02 2.040e+02 3.306e+02, threshold=3.606e+02, percent-clipped=0.0 2023-10-12 01:54:09,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=902118.0, ans=0.035 2023-10-12 01:54:20,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=902164.6666666666, ans=0.125 2023-10-12 01:54:20,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=902164.6666666666, ans=0.0 2023-10-12 01:54:24,745 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 01:54:30,568 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.37 vs. limit=22.5 2023-10-12 01:54:41,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=902258.0, ans=0.125 2023-10-12 01:54:42,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=902258.0, ans=0.04949747468305833 2023-10-12 01:55:00,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=902304.6666666666, ans=0.125 2023-10-12 01:55:03,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=902351.3333333334, ans=0.0 2023-10-12 01:55:18,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=902398.0, ans=0.1 2023-10-12 01:55:33,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=902444.6666666666, ans=0.0 2023-10-12 01:55:40,998 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.780e+02 1.980e+02 2.282e+02 3.090e+02, threshold=3.960e+02, percent-clipped=0.0 2023-10-12 01:55:57,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=902538.0, ans=0.125 2023-10-12 01:55:59,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=902584.6666666666, ans=0.125 2023-10-12 01:56:01,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.79 vs. limit=15.0 2023-10-12 01:56:12,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=902631.3333333334, ans=0.125 2023-10-12 01:56:15,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=902631.3333333334, ans=0.95 2023-10-12 01:56:31,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=902724.6666666666, ans=0.2 2023-10-12 01:57:01,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=902818.0, ans=0.0 2023-10-12 01:57:02,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=902864.6666666666, ans=0.0 2023-10-12 01:57:04,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.07 vs. limit=15.0 2023-10-12 01:57:30,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.751e+02 1.962e+02 2.180e+02 3.937e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-12 01:57:42,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=903004.6666666666, ans=0.2 2023-10-12 01:58:00,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=903098.0, ans=0.1 2023-10-12 01:58:25,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=903191.3333333334, ans=0.125 2023-10-12 01:58:26,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=903191.3333333334, ans=0.125 2023-10-12 01:58:40,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=903238.0, ans=0.1 2023-10-12 01:58:46,175 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.95 vs. limit=15.0 2023-10-12 01:58:47,609 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.14 vs. limit=15.0 2023-10-12 01:58:52,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=903284.6666666666, ans=0.125 2023-10-12 01:59:08,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=903378.0, ans=0.2 2023-10-12 01:59:22,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=903424.6666666666, ans=0.07 2023-10-12 01:59:23,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.747e+02 1.925e+02 2.113e+02 2.789e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-12 01:59:31,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=903471.3333333334, ans=0.125 2023-10-12 01:59:36,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=903471.3333333334, ans=0.125 2023-10-12 01:59:38,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=903471.3333333334, ans=0.125 2023-10-12 01:59:42,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=903518.0, ans=0.1 2023-10-12 01:59:42,609 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-10-12 01:59:45,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=903518.0, ans=0.0 2023-10-12 01:59:46,712 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:00:03,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=903611.3333333334, ans=0.5 2023-10-12 02:00:14,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=903658.0, ans=0.0 2023-10-12 02:00:23,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=903704.6666666666, ans=0.0 2023-10-12 02:00:23,895 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-10-12 02:00:24,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=903704.6666666666, ans=0.1 2023-10-12 02:00:31,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=12.0 2023-10-12 02:00:34,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=903751.3333333334, ans=0.125 2023-10-12 02:00:34,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=903751.3333333334, ans=0.125 2023-10-12 02:00:45,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=903751.3333333334, ans=0.1 2023-10-12 02:00:48,965 INFO [train.py:1031] (0/4) Epoch 15, batch 2500, loss[loss=0.2049, simple_loss=0.2986, pruned_loss=0.05554, over 16875.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.287, pruned_loss=0.05334, over 23437042.46 frames. ], batch size: 110, lr: 2.36e-03, grad_scale: 32.0 2023-10-12 02:00:49,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=903798.0, ans=0.125 2023-10-12 02:00:56,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=903798.0, ans=0.125 2023-10-12 02:01:14,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.720e+02 1.887e+02 2.148e+02 2.728e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-12 02:01:25,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-10-12 02:01:29,974 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.49 vs. limit=10.0 2023-10-12 02:01:46,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=904031.3333333334, ans=0.5 2023-10-12 02:01:59,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=904078.0, ans=0.125 2023-10-12 02:01:59,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=904078.0, ans=0.125 2023-10-12 02:02:19,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=904171.3333333334, ans=0.125 2023-10-12 02:02:21,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=904171.3333333334, ans=0.0 2023-10-12 02:02:37,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=904264.6666666666, ans=0.125 2023-10-12 02:02:38,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=904264.6666666666, ans=0.2 2023-10-12 02:02:41,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=904264.6666666666, ans=0.125 2023-10-12 02:02:51,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=904311.3333333334, ans=0.0 2023-10-12 02:02:53,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=904311.3333333334, ans=0.0 2023-10-12 02:03:00,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=904358.0, ans=0.2 2023-10-12 02:03:01,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.09 vs. limit=15.0 2023-10-12 02:03:06,600 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.753e+02 1.966e+02 2.340e+02 3.159e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-12 02:03:25,477 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-10-12 02:03:33,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=904498.0, ans=0.0 2023-10-12 02:03:54,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=904591.3333333334, ans=0.1 2023-10-12 02:03:59,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=904591.3333333334, ans=0.09899494936611666 2023-10-12 02:04:08,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.92 vs. limit=15.0 2023-10-12 02:04:10,363 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:04:26,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=904731.3333333334, ans=0.1 2023-10-12 02:04:26,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=904731.3333333334, ans=0.1 2023-10-12 02:04:31,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=904731.3333333334, ans=0.125 2023-10-12 02:04:59,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=904824.6666666666, ans=0.125 2023-10-12 02:04:59,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.675e+02 1.855e+02 2.029e+02 2.893e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-12 02:05:03,064 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.69 vs. limit=6.0 2023-10-12 02:05:11,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=904871.3333333334, ans=0.0 2023-10-12 02:05:16,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=904918.0, ans=0.125 2023-10-12 02:05:38,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=904964.6666666666, ans=0.125 2023-10-12 02:05:49,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=905011.3333333334, ans=0.0 2023-10-12 02:05:54,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.30 vs. limit=15.0 2023-10-12 02:06:19,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.26 vs. limit=15.0 2023-10-12 02:06:23,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.36 vs. limit=6.0 2023-10-12 02:06:33,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=905198.0, ans=0.125 2023-10-12 02:06:42,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=905244.6666666666, ans=0.125 2023-10-12 02:07:00,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.628e+02 1.792e+02 1.996e+02 3.622e+02, threshold=3.584e+02, percent-clipped=0.0 2023-10-12 02:07:21,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=905384.6666666666, ans=0.1 2023-10-12 02:07:27,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=905384.6666666666, ans=0.125 2023-10-12 02:07:29,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=905384.6666666666, ans=0.0 2023-10-12 02:08:27,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=905618.0, ans=0.125 2023-10-12 02:08:43,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=905664.6666666666, ans=0.035 2023-10-12 02:08:57,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=905758.0, ans=0.0 2023-10-12 02:09:03,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=905758.0, ans=0.0 2023-10-12 02:09:05,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.748e+02 1.913e+02 2.184e+02 3.105e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-12 02:09:24,954 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2023-10-12 02:09:34,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.09 vs. limit=10.0 2023-10-12 02:09:39,417 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.15 vs. limit=10.0 2023-10-12 02:10:07,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=906038.0, ans=0.0 2023-10-12 02:10:08,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=22.5 2023-10-12 02:10:10,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=906038.0, ans=0.1 2023-10-12 02:10:24,510 INFO [train.py:1031] (0/4) Epoch 15, batch 3000, loss[loss=0.2325, simple_loss=0.2987, pruned_loss=0.08319, over 15745.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2862, pruned_loss=0.05327, over 25512153.14 frames. ], batch size: 350, lr: 2.36e-03, grad_scale: 16.0 2023-10-12 02:10:35,853 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:10:42,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.55 vs. limit=10.0 2023-10-12 02:10:51,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=906224.6666666666, ans=0.2 2023-10-12 02:10:53,867 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.740e+02 1.984e+02 2.216e+02 3.623e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-12 02:10:58,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=906271.3333333334, ans=0.125 2023-10-12 02:11:01,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=906271.3333333334, ans=0.1 2023-10-12 02:11:02,974 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-10-12 02:11:03,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.14 vs. limit=15.0 2023-10-12 02:11:32,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=906411.3333333334, ans=0.125 2023-10-12 02:11:43,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=906411.3333333334, ans=0.125 2023-10-12 02:11:51,500 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-10-12 02:12:37,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=906644.6666666666, ans=0.1 2023-10-12 02:12:48,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=906691.3333333334, ans=0.0 2023-10-12 02:12:54,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.816e+02 1.994e+02 2.299e+02 3.375e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-12 02:12:55,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=906691.3333333334, ans=0.125 2023-10-12 02:13:06,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=906738.0, ans=0.125 2023-10-12 02:13:26,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=906831.3333333334, ans=0.0 2023-10-12 02:13:30,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.31 vs. limit=10.0 2023-10-12 02:13:46,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=906924.6666666666, ans=0.025 2023-10-12 02:13:48,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=906924.6666666666, ans=0.125 2023-10-12 02:13:52,550 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.25 vs. limit=15.0 2023-10-12 02:13:58,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=906971.3333333334, ans=0.125 2023-10-12 02:14:09,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=907018.0, ans=0.125 2023-10-12 02:14:23,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=907111.3333333334, ans=0.125 2023-10-12 02:14:25,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=907111.3333333334, ans=0.125 2023-10-12 02:14:50,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.660e+02 1.817e+02 2.016e+02 3.077e+02, threshold=3.634e+02, percent-clipped=0.0 2023-10-12 02:15:06,434 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.19 vs. limit=22.5 2023-10-12 02:15:08,818 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.21 vs. limit=15.0 2023-10-12 02:15:09,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=907251.3333333334, ans=0.1 2023-10-12 02:15:19,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=907298.0, ans=0.125 2023-10-12 02:15:20,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=907298.0, ans=0.0 2023-10-12 02:15:40,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=907344.6666666666, ans=0.07 2023-10-12 02:16:12,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=907484.6666666666, ans=0.035 2023-10-12 02:16:13,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=907484.6666666666, ans=0.09899494936611666 2023-10-12 02:16:26,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=907531.3333333334, ans=0.125 2023-10-12 02:16:26,704 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:16:28,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=907578.0, ans=0.125 2023-10-12 02:16:46,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=907624.6666666666, ans=0.0 2023-10-12 02:16:47,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.697e+02 1.826e+02 2.010e+02 2.962e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-12 02:17:08,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=907718.0, ans=0.125 2023-10-12 02:17:38,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=907858.0, ans=0.125 2023-10-12 02:17:44,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=907858.0, ans=0.125 2023-10-12 02:17:48,800 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=22.5 2023-10-12 02:18:03,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=907951.3333333334, ans=0.125 2023-10-12 02:18:08,472 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-10-12 02:18:11,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=907951.3333333334, ans=0.5 2023-10-12 02:18:20,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=907998.0, ans=0.125 2023-10-12 02:18:21,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-10-12 02:18:29,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=908044.6666666666, ans=0.125 2023-10-12 02:18:34,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=908044.6666666666, ans=0.0 2023-10-12 02:18:45,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.766e+02 1.959e+02 2.298e+02 3.074e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-12 02:18:46,507 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.24 vs. limit=22.5 2023-10-12 02:18:50,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=908138.0, ans=0.1 2023-10-12 02:18:57,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=908184.6666666666, ans=0.0 2023-10-12 02:19:01,895 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.58 vs. limit=22.5 2023-10-12 02:19:39,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=908324.6666666666, ans=0.125 2023-10-12 02:19:41,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-12 02:19:49,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=908371.3333333334, ans=0.1 2023-10-12 02:19:55,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=908418.0, ans=0.125 2023-10-12 02:20:02,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=908418.0, ans=0.1 2023-10-12 02:20:05,107 INFO [train.py:1031] (0/4) Epoch 15, batch 3500, loss[loss=0.199, simple_loss=0.2909, pruned_loss=0.05352, over 16911.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2858, pruned_loss=0.05313, over 27125021.20 frames. ], batch size: 110, lr: 2.36e-03, grad_scale: 16.0 2023-10-12 02:20:13,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=908464.6666666666, ans=0.125 2023-10-12 02:20:18,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=908511.3333333334, ans=0.0 2023-10-12 02:20:34,406 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:20:34,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.775e+02 1.931e+02 2.165e+02 3.104e+02, threshold=3.862e+02, percent-clipped=0.0 2023-10-12 02:20:41,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-10-12 02:20:48,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=908651.3333333334, ans=0.04949747468305833 2023-10-12 02:21:03,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=908698.0, ans=0.125 2023-10-12 02:21:10,834 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:21:13,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=908744.6666666666, ans=0.125 2023-10-12 02:21:43,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-10-12 02:21:46,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=908838.0, ans=0.125 2023-10-12 02:22:00,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=908884.6666666666, ans=0.2 2023-10-12 02:22:02,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.80 vs. limit=22.5 2023-10-12 02:22:19,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=908978.0, ans=0.0 2023-10-12 02:22:33,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.704e+02 1.840e+02 2.033e+02 2.783e+02, threshold=3.679e+02, percent-clipped=0.0 2023-10-12 02:22:37,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=909071.3333333334, ans=0.95 2023-10-12 02:22:39,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=909071.3333333334, ans=0.125 2023-10-12 02:22:52,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=12.0 2023-10-12 02:22:58,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=909118.0, ans=0.2 2023-10-12 02:23:07,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=909164.6666666666, ans=0.125 2023-10-12 02:23:14,540 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:23:35,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=909304.6666666666, ans=0.125 2023-10-12 02:24:14,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=909444.6666666666, ans=0.125 2023-10-12 02:24:34,329 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.619e+02 1.761e+02 2.000e+02 3.477e+02, threshold=3.522e+02, percent-clipped=0.0 2023-10-12 02:24:45,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=909538.0, ans=0.07 2023-10-12 02:24:57,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=909584.6666666666, ans=0.0 2023-10-12 02:25:13,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=909678.0, ans=0.125 2023-10-12 02:25:16,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=909678.0, ans=0.0 2023-10-12 02:25:32,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.52 vs. limit=15.0 2023-10-12 02:25:33,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=909724.6666666666, ans=0.0 2023-10-12 02:25:46,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=909771.3333333334, ans=0.125 2023-10-12 02:25:46,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=909771.3333333334, ans=0.1 2023-10-12 02:26:18,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.39 vs. limit=10.0 2023-10-12 02:26:20,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=909911.3333333334, ans=0.125 2023-10-12 02:26:22,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=909958.0, ans=0.0 2023-10-12 02:26:33,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.867e+02 2.074e+02 2.353e+02 3.207e+02, threshold=4.148e+02, percent-clipped=0.0 2023-10-12 02:26:49,078 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-10-12 02:27:01,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=910098.0, ans=0.2 2023-10-12 02:27:09,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=910144.6666666666, ans=0.125 2023-10-12 02:27:28,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-10-12 02:27:39,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=910238.0, ans=0.125 2023-10-12 02:27:56,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=910331.3333333334, ans=0.0 2023-10-12 02:28:06,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=910378.0, ans=0.125 2023-10-12 02:28:09,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=910378.0, ans=0.07 2023-10-12 02:28:23,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.713e+02 1.882e+02 2.049e+02 3.259e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-12 02:28:37,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=910518.0, ans=0.0 2023-10-12 02:28:49,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-10-12 02:28:50,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=910564.6666666666, ans=0.125 2023-10-12 02:28:51,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=910564.6666666666, ans=0.1 2023-10-12 02:29:00,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.48 vs. limit=15.0 2023-10-12 02:29:02,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=910611.3333333334, ans=0.5 2023-10-12 02:29:12,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=910658.0, ans=0.0 2023-10-12 02:29:13,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=910658.0, ans=0.125 2023-10-12 02:29:40,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=910751.3333333334, ans=0.0 2023-10-12 02:29:43,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=910751.3333333334, ans=0.0 2023-10-12 02:29:44,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=910751.3333333334, ans=0.125 2023-10-12 02:29:45,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=910798.0, ans=0.0 2023-10-12 02:29:46,032 INFO [train.py:1031] (0/4) Epoch 15, batch 4000, loss[loss=0.2213, simple_loss=0.3049, pruned_loss=0.06879, over 15983.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2857, pruned_loss=0.05333, over 28374837.38 frames. ], batch size: 296, lr: 2.35e-03, grad_scale: 32.0 2023-10-12 02:30:17,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=910891.3333333334, ans=0.0 2023-10-12 02:30:20,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.688e+02 1.861e+02 2.085e+02 3.110e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-12 02:30:21,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=910938.0, ans=0.125 2023-10-12 02:30:28,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=910938.0, ans=0.0 2023-10-12 02:30:48,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=911031.3333333334, ans=0.125 2023-10-12 02:31:18,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=911171.3333333334, ans=0.2 2023-10-12 02:31:38,656 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.43 vs. limit=15.0 2023-10-12 02:31:43,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=911264.6666666666, ans=0.2 2023-10-12 02:31:48,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=911264.6666666666, ans=0.125 2023-10-12 02:32:02,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=911358.0, ans=0.5 2023-10-12 02:32:03,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=911358.0, ans=0.0 2023-10-12 02:32:14,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.722e+02 1.851e+02 2.066e+02 3.254e+02, threshold=3.702e+02, percent-clipped=0.0 2023-10-12 02:32:24,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=911404.6666666666, ans=0.125 2023-10-12 02:32:37,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=911451.3333333334, ans=0.0 2023-10-12 02:32:39,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=911498.0, ans=0.125 2023-10-12 02:33:00,020 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:33:24,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=911638.0, ans=0.125 2023-10-12 02:33:27,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=911638.0, ans=0.5 2023-10-12 02:33:35,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=911638.0, ans=0.02 2023-10-12 02:33:46,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.19 vs. limit=15.0 2023-10-12 02:33:57,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=911731.3333333334, ans=0.1 2023-10-12 02:34:16,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=911824.6666666666, ans=0.1 2023-10-12 02:34:23,284 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.663e+02 1.788e+02 2.049e+02 3.231e+02, threshold=3.577e+02, percent-clipped=0.0 2023-10-12 02:34:30,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=911871.3333333334, ans=0.0 2023-10-12 02:34:33,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=911871.3333333334, ans=0.2 2023-10-12 02:34:37,524 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-10-12 02:34:50,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=911964.6666666666, ans=0.0 2023-10-12 02:34:54,423 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:35:41,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=912198.0, ans=0.125 2023-10-12 02:35:49,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=912198.0, ans=0.125 2023-10-12 02:36:13,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.709e+02 1.935e+02 2.177e+02 3.886e+02, threshold=3.870e+02, percent-clipped=2.0 2023-10-12 02:36:29,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=912384.6666666666, ans=0.125 2023-10-12 02:36:42,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.91 vs. limit=15.0 2023-10-12 02:36:44,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=912431.3333333334, ans=0.0 2023-10-12 02:37:03,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=912524.6666666666, ans=15.0 2023-10-12 02:37:05,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=912524.6666666666, ans=0.0 2023-10-12 02:37:10,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=912524.6666666666, ans=0.2 2023-10-12 02:37:27,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=912618.0, ans=0.1 2023-10-12 02:38:08,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.815e+02 1.963e+02 2.194e+02 3.165e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-12 02:38:13,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=912804.6666666666, ans=0.125 2023-10-12 02:38:23,483 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.19 vs. limit=6.0 2023-10-12 02:38:49,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-10-12 02:39:09,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=912991.3333333334, ans=0.0 2023-10-12 02:39:15,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=913038.0, ans=0.0 2023-10-12 02:39:39,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.04 vs. limit=22.5 2023-10-12 02:39:40,378 INFO [train.py:1031] (0/4) Epoch 15, batch 4500, loss[loss=0.1919, simple_loss=0.2855, pruned_loss=0.0492, over 16851.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2861, pruned_loss=0.05317, over 29354531.96 frames. ], batch size: 87, lr: 2.35e-03, grad_scale: 32.0 2023-10-12 02:40:12,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.653e+02 1.778e+02 1.977e+02 2.656e+02, threshold=3.556e+02, percent-clipped=0.0 2023-10-12 02:40:18,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=913271.3333333334, ans=0.2 2023-10-12 02:40:22,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=913318.0, ans=0.125 2023-10-12 02:40:23,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=913318.0, ans=0.1 2023-10-12 02:40:24,113 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-10-12 02:40:26,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=913318.0, ans=0.125 2023-10-12 02:41:09,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=913504.6666666666, ans=0.125 2023-10-12 02:41:27,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-10-12 02:41:28,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=913551.3333333334, ans=0.015 2023-10-12 02:41:28,663 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 02:41:30,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.36 vs. limit=12.0 2023-10-12 02:41:57,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=913691.3333333334, ans=0.0 2023-10-12 02:41:58,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=913691.3333333334, ans=0.1 2023-10-12 02:42:01,271 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.762e+02 1.937e+02 2.212e+02 2.712e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-12 02:42:23,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=913831.3333333334, ans=0.125 2023-10-12 02:42:32,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=913831.3333333334, ans=0.2 2023-10-12 02:42:42,789 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.84 vs. limit=22.5 2023-10-12 02:42:46,518 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-10-12 02:42:58,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=913971.3333333334, ans=0.125 2023-10-12 02:43:01,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=913971.3333333334, ans=0.125 2023-10-12 02:43:11,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=914018.0, ans=0.0 2023-10-12 02:43:44,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=914158.0, ans=0.0 2023-10-12 02:43:50,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.663e+02 1.861e+02 2.060e+02 4.129e+02, threshold=3.722e+02, percent-clipped=1.0 2023-10-12 02:43:54,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=914204.6666666666, ans=0.0 2023-10-12 02:43:54,674 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=22.5 2023-10-12 02:43:58,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=914204.6666666666, ans=0.125 2023-10-12 02:44:06,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=914251.3333333334, ans=0.04949747468305833 2023-10-12 02:44:14,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=914298.0, ans=0.125 2023-10-12 02:45:01,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=914484.6666666666, ans=0.2 2023-10-12 02:45:13,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=914531.3333333334, ans=0.1 2023-10-12 02:45:30,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=914578.0, ans=0.125 2023-10-12 02:45:36,545 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.14 vs. limit=22.5 2023-10-12 02:45:39,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=914624.6666666666, ans=0.125 2023-10-12 02:45:45,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.704e+02 1.897e+02 2.105e+02 2.995e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-12 02:46:03,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=914718.0, ans=0.125 2023-10-12 02:46:19,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-10-12 02:46:25,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=914811.3333333334, ans=0.0 2023-10-12 02:46:59,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=914951.3333333334, ans=0.0 2023-10-12 02:47:20,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.06 vs. limit=15.0 2023-10-12 02:47:35,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=915091.3333333334, ans=0.125 2023-10-12 02:47:38,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=915091.3333333334, ans=0.0 2023-10-12 02:47:44,023 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.682e+02 1.859e+02 2.153e+02 2.870e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-12 02:47:54,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=915138.0, ans=0.0 2023-10-12 02:48:07,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=915184.6666666666, ans=0.015 2023-10-12 02:48:12,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=915231.3333333334, ans=0.125 2023-10-12 02:48:15,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=915231.3333333334, ans=0.0 2023-10-12 02:48:17,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=915231.3333333334, ans=0.125 2023-10-12 02:48:17,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=915231.3333333334, ans=0.2 2023-10-12 02:48:32,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=915324.6666666666, ans=0.1 2023-10-12 02:48:40,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=915324.6666666666, ans=22.5 2023-10-12 02:48:50,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=915371.3333333334, ans=0.07 2023-10-12 02:48:56,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=915418.0, ans=0.0 2023-10-12 02:49:06,553 INFO [train.py:1031] (0/4) Epoch 15, batch 5000, loss[loss=0.2061, simple_loss=0.2964, pruned_loss=0.05794, over 16739.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.286, pruned_loss=0.05334, over 30140178.60 frames. ], batch size: 202, lr: 2.35e-03, grad_scale: 32.0 2023-10-12 02:49:06,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=915464.6666666666, ans=0.125 2023-10-12 02:49:06,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=915464.6666666666, ans=0.0 2023-10-12 02:49:06,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=915464.6666666666, ans=0.0 2023-10-12 02:49:11,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=915464.6666666666, ans=0.0 2023-10-12 02:49:18,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=915511.3333333334, ans=0.0 2023-10-12 02:49:19,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=915511.3333333334, ans=0.0 2023-10-12 02:49:40,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=915558.0, ans=0.09899494936611666 2023-10-12 02:49:41,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.707e+02 1.868e+02 2.029e+02 3.211e+02, threshold=3.737e+02, percent-clipped=0.0 2023-10-12 02:49:44,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=915604.6666666666, ans=0.125 2023-10-12 02:49:44,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=915604.6666666666, ans=0.125 2023-10-12 02:49:48,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=915604.6666666666, ans=0.125 2023-10-12 02:49:52,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-10-12 02:50:03,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=915698.0, ans=0.0 2023-10-12 02:50:32,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=915791.3333333334, ans=15.0 2023-10-12 02:50:43,554 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.80 vs. limit=10.0 2023-10-12 02:51:07,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=915931.3333333334, ans=0.125 2023-10-12 02:51:16,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=915978.0, ans=0.0 2023-10-12 02:51:20,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=915978.0, ans=0.0 2023-10-12 02:51:31,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=916024.6666666666, ans=0.0 2023-10-12 02:51:36,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.669e+02 1.787e+02 1.999e+02 3.159e+02, threshold=3.573e+02, percent-clipped=0.0 2023-10-12 02:51:37,469 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.34 vs. limit=22.5 2023-10-12 02:51:55,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=916118.0, ans=0.0 2023-10-12 02:52:02,136 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-10-12 02:52:16,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916211.3333333334, ans=0.1 2023-10-12 02:52:21,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=916258.0, ans=0.125 2023-10-12 02:52:28,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=916258.0, ans=0.125 2023-10-12 02:52:29,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=916258.0, ans=0.125 2023-10-12 02:52:33,358 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.49 vs. limit=15.0 2023-10-12 02:52:33,513 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.55 vs. limit=15.0 2023-10-12 02:52:40,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.71 vs. limit=22.5 2023-10-12 02:52:42,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=916351.3333333334, ans=0.04949747468305833 2023-10-12 02:52:59,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=916398.0, ans=0.2 2023-10-12 02:53:00,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=916398.0, ans=0.125 2023-10-12 02:53:01,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=916398.0, ans=0.125 2023-10-12 02:53:06,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916444.6666666666, ans=0.1 2023-10-12 02:53:08,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=916444.6666666666, ans=0.125 2023-10-12 02:53:11,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=916444.6666666666, ans=0.0 2023-10-12 02:53:17,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=916491.3333333334, ans=0.2 2023-10-12 02:53:25,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.721e+02 1.949e+02 2.277e+02 3.151e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-12 02:53:43,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-10-12 02:53:49,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=916631.3333333334, ans=0.125 2023-10-12 02:54:09,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916678.0, ans=0.1 2023-10-12 02:54:09,232 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.73 vs. limit=15.0 2023-10-12 02:54:25,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=916771.3333333334, ans=0.0 2023-10-12 02:54:44,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.05 vs. limit=15.0 2023-10-12 02:54:49,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.80 vs. limit=22.5 2023-10-12 02:54:55,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=916864.6666666666, ans=0.0 2023-10-12 02:55:20,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=916958.0, ans=0.0 2023-10-12 02:55:24,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.662e+02 1.821e+02 1.998e+02 3.002e+02, threshold=3.641e+02, percent-clipped=0.0 2023-10-12 02:55:34,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=917004.6666666666, ans=0.1 2023-10-12 02:55:51,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=917098.0, ans=0.125 2023-10-12 02:56:24,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=917238.0, ans=0.125 2023-10-12 02:56:36,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=917284.6666666666, ans=0.2 2023-10-12 02:56:46,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=917331.3333333334, ans=0.125 2023-10-12 02:56:50,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=917331.3333333334, ans=0.1 2023-10-12 02:57:05,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=917424.6666666666, ans=0.125 2023-10-12 02:57:15,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=917424.6666666666, ans=0.04949747468305833 2023-10-12 02:57:17,593 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.661e+02 1.823e+02 1.960e+02 3.194e+02, threshold=3.646e+02, percent-clipped=0.0 2023-10-12 02:57:21,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=917471.3333333334, ans=0.125 2023-10-12 02:57:30,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=917518.0, ans=0.1 2023-10-12 02:58:12,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=917704.6666666666, ans=0.125 2023-10-12 02:58:35,959 INFO [train.py:1031] (0/4) Epoch 15, batch 5500, loss[loss=0.182, simple_loss=0.2812, pruned_loss=0.04141, over 16555.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2858, pruned_loss=0.05305, over 30756130.09 frames. ], batch size: 219, lr: 2.34e-03, grad_scale: 32.0 2023-10-12 02:58:38,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=917798.0, ans=0.1 2023-10-12 02:58:56,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=917891.3333333334, ans=0.0 2023-10-12 02:59:06,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.660e+02 1.785e+02 1.989e+02 2.915e+02, threshold=3.570e+02, percent-clipped=0.0 2023-10-12 02:59:09,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=917938.0, ans=0.0 2023-10-12 02:59:15,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=917938.0, ans=0.125 2023-10-12 02:59:19,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=15.0 2023-10-12 02:59:26,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=917984.6666666666, ans=0.0 2023-10-12 02:59:30,775 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.41 vs. limit=5.0 2023-10-12 02:59:47,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=918078.0, ans=0.1 2023-10-12 02:59:50,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-10-12 03:00:00,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=918171.3333333334, ans=0.2 2023-10-12 03:00:01,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=918171.3333333334, ans=0.0 2023-10-12 03:00:21,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=918218.0, ans=0.0 2023-10-12 03:00:45,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=918358.0, ans=0.0 2023-10-12 03:00:49,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=918358.0, ans=0.125 2023-10-12 03:00:50,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=918358.0, ans=0.125 2023-10-12 03:00:51,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=918358.0, ans=0.1 2023-10-12 03:00:56,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.766e+02 1.963e+02 2.144e+02 3.137e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-12 03:01:09,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.12 vs. limit=22.5 2023-10-12 03:01:09,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=918451.3333333334, ans=0.07 2023-10-12 03:01:20,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=918498.0, ans=0.125 2023-10-12 03:01:24,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-12 03:01:56,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=918638.0, ans=0.0 2023-10-12 03:01:57,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=918638.0, ans=0.125 2023-10-12 03:02:05,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=918684.6666666666, ans=0.125 2023-10-12 03:02:08,424 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:02:09,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=918684.6666666666, ans=0.125 2023-10-12 03:02:34,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=918778.0, ans=0.125 2023-10-12 03:02:51,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.673e+02 1.868e+02 2.178e+02 3.112e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-12 03:03:15,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=918964.6666666666, ans=0.125 2023-10-12 03:03:20,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=918964.6666666666, ans=0.0 2023-10-12 03:03:43,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=919058.0, ans=0.0 2023-10-12 03:03:48,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=919104.6666666666, ans=0.125 2023-10-12 03:03:52,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=919104.6666666666, ans=0.125 2023-10-12 03:04:12,847 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.80 vs. limit=15.0 2023-10-12 03:04:16,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=919198.0, ans=0.0 2023-10-12 03:04:37,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=919291.3333333334, ans=0.125 2023-10-12 03:04:46,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.736e+02 1.896e+02 2.130e+02 3.157e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-12 03:05:00,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=919384.6666666666, ans=0.125 2023-10-12 03:05:10,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=919431.3333333334, ans=0.125 2023-10-12 03:05:33,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=919524.6666666666, ans=0.0 2023-10-12 03:05:41,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=919524.6666666666, ans=0.1 2023-10-12 03:06:06,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=919664.6666666666, ans=0.0 2023-10-12 03:06:09,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=919664.6666666666, ans=0.125 2023-10-12 03:06:10,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919664.6666666666, ans=0.1 2023-10-12 03:06:14,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.19 vs. limit=10.0 2023-10-12 03:06:28,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=919758.0, ans=0.0 2023-10-12 03:06:31,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=919758.0, ans=0.125 2023-10-12 03:06:31,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.21 vs. limit=15.0 2023-10-12 03:06:38,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=919758.0, ans=0.125 2023-10-12 03:06:38,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=919758.0, ans=15.0 2023-10-12 03:06:41,662 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:06:42,234 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.663e+02 1.806e+02 2.065e+02 2.820e+02, threshold=3.611e+02, percent-clipped=0.0 2023-10-12 03:06:46,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=919804.6666666666, ans=12.0 2023-10-12 03:06:47,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=919804.6666666666, ans=0.04949747468305833 2023-10-12 03:07:09,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=919898.0, ans=0.125 2023-10-12 03:07:11,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=919898.0, ans=0.1 2023-10-12 03:07:13,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=919944.6666666666, ans=0.125 2023-10-12 03:07:16,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.75 vs. limit=10.0 2023-10-12 03:07:16,796 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:07:23,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=919991.3333333334, ans=0.125 2023-10-12 03:07:24,081 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.45 vs. limit=15.0 2023-10-12 03:07:27,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.50 vs. limit=15.0 2023-10-12 03:07:36,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=920038.0, ans=0.125 2023-10-12 03:07:44,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=920084.6666666666, ans=0.125 2023-10-12 03:07:51,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=920084.6666666666, ans=0.125 2023-10-12 03:07:53,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=920084.6666666666, ans=0.1 2023-10-12 03:07:58,349 INFO [train.py:1031] (0/4) Epoch 15, batch 6000, loss[loss=0.1944, simple_loss=0.2828, pruned_loss=0.05298, over 16252.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2863, pruned_loss=0.05336, over 31203402.30 frames. ], batch size: 50, lr: 2.34e-03, grad_scale: 16.0 2023-10-12 03:08:07,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=920178.0, ans=0.125 2023-10-12 03:08:32,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.695e+02 1.842e+02 2.023e+02 2.720e+02, threshold=3.683e+02, percent-clipped=0.0 2023-10-12 03:08:56,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=920364.6666666666, ans=0.2 2023-10-12 03:09:00,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2023-10-12 03:09:14,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=920458.0, ans=0.0 2023-10-12 03:09:39,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=920551.3333333334, ans=0.125 2023-10-12 03:09:46,391 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.02 vs. limit=15.0 2023-10-12 03:09:53,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=920598.0, ans=0.2 2023-10-12 03:10:21,497 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.709e+02 1.870e+02 2.124e+02 2.739e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-12 03:10:45,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=920831.3333333334, ans=0.125 2023-10-12 03:11:07,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=920924.6666666666, ans=0.0 2023-10-12 03:11:25,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=921018.0, ans=0.1 2023-10-12 03:11:35,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=921018.0, ans=0.1 2023-10-12 03:12:14,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.312e+02 1.782e+02 1.949e+02 2.266e+02 3.079e+02, threshold=3.898e+02, percent-clipped=0.0 2023-10-12 03:12:14,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=921204.6666666666, ans=0.1 2023-10-12 03:12:16,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=921204.6666666666, ans=0.07 2023-10-12 03:12:24,271 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.87 vs. limit=15.0 2023-10-12 03:12:28,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=921251.3333333334, ans=0.125 2023-10-12 03:12:40,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=921298.0, ans=0.125 2023-10-12 03:12:42,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=921298.0, ans=0.1 2023-10-12 03:12:42,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=921298.0, ans=0.07 2023-10-12 03:12:54,401 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:13:11,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=921438.0, ans=0.125 2023-10-12 03:13:16,139 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=15.0 2023-10-12 03:13:27,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=921484.6666666666, ans=0.125 2023-10-12 03:13:42,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=921578.0, ans=0.125 2023-10-12 03:13:55,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=921624.6666666666, ans=0.125 2023-10-12 03:13:56,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=921624.6666666666, ans=0.1 2023-10-12 03:14:08,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.862e+02 2.069e+02 2.311e+02 3.162e+02, threshold=4.139e+02, percent-clipped=0.0 2023-10-12 03:14:09,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=921671.3333333334, ans=0.125 2023-10-12 03:14:46,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=921811.3333333334, ans=0.09899494936611666 2023-10-12 03:14:48,001 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0 2023-10-12 03:15:00,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=921858.0, ans=0.05 2023-10-12 03:15:08,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-10-12 03:15:09,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=921904.6666666666, ans=0.1 2023-10-12 03:15:12,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=921904.6666666666, ans=0.125 2023-10-12 03:15:54,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=922091.3333333334, ans=0.125 2023-10-12 03:16:10,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.622e+02 1.795e+02 2.132e+02 3.159e+02, threshold=3.589e+02, percent-clipped=0.0 2023-10-12 03:16:29,827 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.89 vs. limit=15.0 2023-10-12 03:16:35,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=922231.3333333334, ans=0.125 2023-10-12 03:16:47,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=922278.0, ans=0.0 2023-10-12 03:16:56,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.92 vs. limit=15.0 2023-10-12 03:17:10,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=922371.3333333334, ans=0.1 2023-10-12 03:17:31,677 INFO [train.py:1031] (0/4) Epoch 15, batch 6500, loss[loss=0.189, simple_loss=0.2723, pruned_loss=0.05287, over 16628.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2868, pruned_loss=0.05348, over 31582380.16 frames. ], batch size: 61, lr: 2.34e-03, grad_scale: 16.0 2023-10-12 03:17:34,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=922464.6666666666, ans=0.0 2023-10-12 03:17:57,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=922558.0, ans=0.0 2023-10-12 03:18:00,968 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:18:17,209 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.758e+02 1.901e+02 2.104e+02 2.557e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-12 03:18:39,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=922698.0, ans=0.125 2023-10-12 03:18:46,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=922698.0, ans=0.1 2023-10-12 03:19:11,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.09 vs. limit=15.0 2023-10-12 03:19:20,595 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2023-10-12 03:19:26,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.71 vs. limit=22.5 2023-10-12 03:19:50,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=922978.0, ans=0.125 2023-10-12 03:19:54,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=922978.0, ans=0.1 2023-10-12 03:19:59,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923024.6666666666, ans=0.1 2023-10-12 03:20:11,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.710e+02 1.917e+02 2.096e+02 2.986e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-12 03:20:25,645 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-12 03:20:32,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=923164.6666666666, ans=0.0 2023-10-12 03:20:46,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.75 vs. limit=10.0 2023-10-12 03:20:49,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=923258.0, ans=0.1 2023-10-12 03:20:55,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=923258.0, ans=0.2 2023-10-12 03:20:55,394 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.82 vs. limit=22.5 2023-10-12 03:21:01,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=923304.6666666666, ans=0.09899494936611666 2023-10-12 03:21:06,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=923304.6666666666, ans=0.125 2023-10-12 03:21:28,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=923398.0, ans=0.125 2023-10-12 03:21:32,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=923444.6666666666, ans=0.125 2023-10-12 03:21:44,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=22.5 2023-10-12 03:21:44,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=923491.3333333334, ans=0.1 2023-10-12 03:21:51,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=923491.3333333334, ans=0.125 2023-10-12 03:21:57,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923538.0, ans=0.1 2023-10-12 03:21:58,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.657e+02 1.816e+02 2.074e+02 2.899e+02, threshold=3.633e+02, percent-clipped=0.0 2023-10-12 03:22:00,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=923538.0, ans=0.125 2023-10-12 03:22:02,107 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.25 vs. limit=15.0 2023-10-12 03:22:07,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.63 vs. limit=15.0 2023-10-12 03:22:13,071 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2023-10-12 03:22:16,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=923584.6666666666, ans=0.125 2023-10-12 03:22:18,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=923631.3333333334, ans=0.125 2023-10-12 03:22:28,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=923631.3333333334, ans=0.125 2023-10-12 03:22:52,132 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-10-12 03:23:24,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923864.6666666666, ans=0.1 2023-10-12 03:23:31,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=923864.6666666666, ans=0.125 2023-10-12 03:23:32,691 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.80 vs. limit=15.0 2023-10-12 03:23:53,044 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.87 vs. limit=22.5 2023-10-12 03:23:56,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=923958.0, ans=0.125 2023-10-12 03:24:07,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.642e+02 1.830e+02 2.021e+02 2.847e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-12 03:24:24,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=924098.0, ans=0.0 2023-10-12 03:24:27,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=924098.0, ans=0.5 2023-10-12 03:24:28,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=924098.0, ans=12.0 2023-10-12 03:24:49,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=924191.3333333334, ans=0.0 2023-10-12 03:24:51,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=924191.3333333334, ans=10.0 2023-10-12 03:25:03,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=924238.0, ans=0.1 2023-10-12 03:25:07,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=924238.0, ans=0.125 2023-10-12 03:25:08,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=924238.0, ans=0.1 2023-10-12 03:25:23,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=924284.6666666666, ans=0.09899494936611666 2023-10-12 03:25:25,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=924331.3333333334, ans=0.0 2023-10-12 03:26:01,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.622e+02 1.788e+02 1.977e+02 2.604e+02, threshold=3.577e+02, percent-clipped=0.0 2023-10-12 03:26:02,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=924471.3333333334, ans=0.2 2023-10-12 03:26:14,380 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:26:24,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=924564.6666666666, ans=0.1 2023-10-12 03:26:25,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=924564.6666666666, ans=0.125 2023-10-12 03:26:35,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.91 vs. limit=15.0 2023-10-12 03:26:37,974 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.44 vs. limit=22.5 2023-10-12 03:26:59,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=924704.6666666666, ans=0.125 2023-10-12 03:26:59,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=924704.6666666666, ans=0.1 2023-10-12 03:27:06,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=924751.3333333334, ans=0.125 2023-10-12 03:27:07,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=924751.3333333334, ans=0.125 2023-10-12 03:27:14,385 INFO [train.py:1031] (0/4) Epoch 15, batch 7000, loss[loss=0.2149, simple_loss=0.2722, pruned_loss=0.07881, over 11990.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2872, pruned_loss=0.05337, over 31877509.76 frames. ], batch size: 440, lr: 2.34e-03, grad_scale: 32.0 2023-10-12 03:27:40,001 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-12 03:27:41,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=924891.3333333334, ans=0.125 2023-10-12 03:27:51,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=924938.0, ans=0.0 2023-10-12 03:27:55,954 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.759e+02 1.899e+02 2.182e+02 3.212e+02, threshold=3.799e+02, percent-clipped=0.0 2023-10-12 03:27:56,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=924938.0, ans=0.0 2023-10-12 03:27:56,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=924938.0, ans=0.2 2023-10-12 03:27:57,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=924938.0, ans=0.2 2023-10-12 03:28:02,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=924984.6666666666, ans=0.125 2023-10-12 03:28:06,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=924984.6666666666, ans=0.0 2023-10-12 03:28:21,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=925031.3333333334, ans=0.125 2023-10-12 03:28:35,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-12 03:28:55,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.43 vs. limit=15.0 2023-10-12 03:29:10,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=925264.6666666666, ans=0.0 2023-10-12 03:29:18,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=925264.6666666666, ans=0.125 2023-10-12 03:29:20,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=925264.6666666666, ans=0.05 2023-10-12 03:29:20,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=925264.6666666666, ans=0.125 2023-10-12 03:29:20,077 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:29:41,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=925358.0, ans=0.0 2023-10-12 03:29:51,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.810e+02 1.962e+02 2.161e+02 3.267e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-12 03:30:00,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=925451.3333333334, ans=0.125 2023-10-12 03:30:12,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=925498.0, ans=0.0 2023-10-12 03:30:26,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=925544.6666666666, ans=0.2 2023-10-12 03:30:30,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=925544.6666666666, ans=0.125 2023-10-12 03:30:50,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=925638.0, ans=0.2 2023-10-12 03:31:07,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=925731.3333333334, ans=0.125 2023-10-12 03:31:07,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=925731.3333333334, ans=0.125 2023-10-12 03:31:13,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=925731.3333333334, ans=0.0 2023-10-12 03:31:25,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.48 vs. limit=15.0 2023-10-12 03:31:25,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=925778.0, ans=15.0 2023-10-12 03:31:56,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.683e+02 1.852e+02 2.052e+02 2.591e+02, threshold=3.704e+02, percent-clipped=0.0 2023-10-12 03:31:59,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=925871.3333333334, ans=0.125 2023-10-12 03:32:17,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=925918.0, ans=0.1 2023-10-12 03:32:20,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=925964.6666666666, ans=0.2 2023-10-12 03:32:22,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925964.6666666666, ans=0.1 2023-10-12 03:32:24,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=925964.6666666666, ans=0.125 2023-10-12 03:32:44,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=926058.0, ans=0.0 2023-10-12 03:32:57,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=926104.6666666666, ans=0.0 2023-10-12 03:33:02,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=926104.6666666666, ans=0.09899494936611666 2023-10-12 03:33:03,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=926104.6666666666, ans=0.1 2023-10-12 03:33:29,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=926244.6666666666, ans=0.125 2023-10-12 03:33:30,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=926244.6666666666, ans=0.125 2023-10-12 03:33:33,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=926244.6666666666, ans=0.05 2023-10-12 03:33:35,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=926244.6666666666, ans=0.1 2023-10-12 03:33:38,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=926244.6666666666, ans=0.0 2023-10-12 03:33:39,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=926244.6666666666, ans=0.125 2023-10-12 03:33:48,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=926291.3333333334, ans=0.0 2023-10-12 03:33:50,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=926291.3333333334, ans=15.0 2023-10-12 03:33:56,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=926338.0, ans=0.125 2023-10-12 03:33:58,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.268e+02 1.690e+02 1.804e+02 1.982e+02 2.772e+02, threshold=3.608e+02, percent-clipped=0.0 2023-10-12 03:34:48,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=926524.6666666666, ans=0.0 2023-10-12 03:34:50,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=926524.6666666666, ans=0.125 2023-10-12 03:35:18,845 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.96 vs. limit=15.0 2023-10-12 03:35:43,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.26 vs. limit=10.0 2023-10-12 03:35:55,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.769e+02 2.005e+02 2.243e+02 2.986e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-12 03:36:07,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=926851.3333333334, ans=0.5 2023-10-12 03:36:08,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.16 vs. limit=10.0 2023-10-12 03:36:20,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=926898.0, ans=0.0 2023-10-12 03:36:24,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=926944.6666666666, ans=10.0 2023-10-12 03:36:24,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=926944.6666666666, ans=0.125 2023-10-12 03:36:32,023 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:36:46,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=927038.0, ans=0.0 2023-10-12 03:37:05,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=927084.6666666666, ans=0.125 2023-10-12 03:37:11,371 INFO [train.py:1031] (0/4) Epoch 15, batch 7500, loss[loss=0.1975, simple_loss=0.2872, pruned_loss=0.05393, over 16911.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2871, pruned_loss=0.05365, over 32039017.77 frames. ], batch size: 104, lr: 2.33e-03, grad_scale: 16.0 2023-10-12 03:37:29,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=927178.0, ans=0.125 2023-10-12 03:37:31,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=927178.0, ans=0.125 2023-10-12 03:37:39,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=927224.6666666666, ans=0.0 2023-10-12 03:37:45,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=927271.3333333334, ans=0.0 2023-10-12 03:37:48,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.697e+02 1.862e+02 2.064e+02 3.666e+02, threshold=3.724e+02, percent-clipped=0.0 2023-10-12 03:37:52,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=927271.3333333334, ans=10.0 2023-10-12 03:38:28,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=927411.3333333334, ans=0.1 2023-10-12 03:38:59,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=927551.3333333334, ans=0.125 2023-10-12 03:39:12,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=927598.0, ans=0.05 2023-10-12 03:39:19,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=927644.6666666666, ans=0.125 2023-10-12 03:39:23,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=927644.6666666666, ans=0.1 2023-10-12 03:39:23,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=927644.6666666666, ans=0.0 2023-10-12 03:39:24,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=927644.6666666666, ans=0.125 2023-10-12 03:39:30,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=927691.3333333334, ans=0.0 2023-10-12 03:39:36,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=927691.3333333334, ans=0.0 2023-10-12 03:39:49,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.739e+02 1.863e+02 2.084e+02 2.661e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-12 03:40:13,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=927831.3333333334, ans=0.1 2023-10-12 03:40:21,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=927831.3333333334, ans=0.125 2023-10-12 03:40:23,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.54 vs. limit=15.0 2023-10-12 03:40:46,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=927924.6666666666, ans=0.125 2023-10-12 03:41:06,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=928018.0, ans=0.125 2023-10-12 03:41:15,040 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.04 vs. limit=15.0 2023-10-12 03:41:22,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=928064.6666666666, ans=0.2 2023-10-12 03:41:40,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=928158.0, ans=0.125 2023-10-12 03:41:40,674 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:41:41,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=928158.0, ans=0.125 2023-10-12 03:41:53,309 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.317e+02 1.651e+02 1.791e+02 1.930e+02 3.019e+02, threshold=3.583e+02, percent-clipped=0.0 2023-10-12 03:41:55,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=928204.6666666666, ans=0.1 2023-10-12 03:42:10,358 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.41 vs. limit=15.0 2023-10-12 03:42:16,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=928298.0, ans=0.125 2023-10-12 03:42:54,711 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.49 vs. limit=15.0 2023-10-12 03:42:55,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=928484.6666666666, ans=0.125 2023-10-12 03:42:59,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=928484.6666666666, ans=0.2 2023-10-12 03:43:10,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-10-12 03:43:45,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=928671.3333333334, ans=0.0 2023-10-12 03:43:50,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.761e+02 1.904e+02 2.172e+02 2.664e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 03:44:00,891 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-10-12 03:44:17,829 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.21 vs. limit=10.0 2023-10-12 03:44:24,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=928811.3333333334, ans=0.2 2023-10-12 03:44:25,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=928811.3333333334, ans=0.125 2023-10-12 03:44:25,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=928811.3333333334, ans=0.0 2023-10-12 03:44:31,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=928858.0, ans=0.125 2023-10-12 03:44:35,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=928858.0, ans=0.1 2023-10-12 03:44:42,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=928904.6666666666, ans=0.125 2023-10-12 03:44:42,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=928904.6666666666, ans=0.0 2023-10-12 03:44:52,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.31 vs. limit=15.0 2023-10-12 03:44:56,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=928951.3333333334, ans=0.125 2023-10-12 03:45:06,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=928998.0, ans=0.0 2023-10-12 03:45:13,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=928998.0, ans=0.2 2023-10-12 03:45:33,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=929091.3333333334, ans=0.125 2023-10-12 03:45:48,366 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.264e+02 1.669e+02 1.824e+02 2.023e+02 2.861e+02, threshold=3.649e+02, percent-clipped=0.0 2023-10-12 03:45:51,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=929138.0, ans=0.125 2023-10-12 03:45:52,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=929138.0, ans=0.0 2023-10-12 03:46:13,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=929231.3333333334, ans=0.125 2023-10-12 03:46:17,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=929278.0, ans=0.125 2023-10-12 03:46:40,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=929371.3333333334, ans=0.1 2023-10-12 03:46:46,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=929371.3333333334, ans=0.125 2023-10-12 03:46:57,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=929418.0, ans=0.125 2023-10-12 03:47:06,127 INFO [train.py:1031] (0/4) Epoch 15, batch 8000, loss[loss=0.1858, simple_loss=0.2802, pruned_loss=0.04574, over 16894.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2865, pruned_loss=0.05319, over 32191403.32 frames. ], batch size: 72, lr: 2.33e-03, grad_scale: 32.0 2023-10-12 03:47:15,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=929464.6666666666, ans=0.0 2023-10-12 03:47:16,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=929511.3333333334, ans=0.1 2023-10-12 03:47:35,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=929558.0, ans=0.125 2023-10-12 03:47:36,020 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-10-12 03:47:36,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=929558.0, ans=0.125 2023-10-12 03:47:45,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.585e+02 1.702e+02 1.898e+02 3.170e+02, threshold=3.404e+02, percent-clipped=0.0 2023-10-12 03:47:51,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=929651.3333333334, ans=0.125 2023-10-12 03:47:54,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=929651.3333333334, ans=0.0 2023-10-12 03:48:01,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=929698.0, ans=0.2 2023-10-12 03:48:07,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=929698.0, ans=0.025 2023-10-12 03:48:17,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=929744.6666666666, ans=0.125 2023-10-12 03:49:12,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=929978.0, ans=0.125 2023-10-12 03:49:23,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=930024.6666666666, ans=0.0 2023-10-12 03:49:32,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.705e+02 1.792e+02 1.964e+02 2.510e+02, threshold=3.584e+02, percent-clipped=0.0 2023-10-12 03:49:41,031 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=22.5 2023-10-12 03:49:59,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=930164.6666666666, ans=0.125 2023-10-12 03:49:59,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=930164.6666666666, ans=0.125 2023-10-12 03:50:14,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=930211.3333333334, ans=0.125 2023-10-12 03:50:44,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=930304.6666666666, ans=0.1 2023-10-12 03:50:46,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=930304.6666666666, ans=0.0 2023-10-12 03:50:46,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-10-12 03:50:50,365 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:51:02,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=930398.0, ans=0.1 2023-10-12 03:51:09,737 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 03:51:10,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=930398.0, ans=0.125 2023-10-12 03:51:20,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=930444.6666666666, ans=0.05 2023-10-12 03:51:41,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=930538.0, ans=0.0 2023-10-12 03:51:42,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=930538.0, ans=0.0 2023-10-12 03:51:43,935 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.281e+02 1.700e+02 1.811e+02 2.027e+02 2.668e+02, threshold=3.622e+02, percent-clipped=0.0 2023-10-12 03:52:04,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=930631.3333333334, ans=0.125 2023-10-12 03:52:04,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=22.5 2023-10-12 03:52:15,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=930678.0, ans=0.125 2023-10-12 03:52:45,568 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.87 vs. limit=22.5 2023-10-12 03:52:48,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=930818.0, ans=0.125 2023-10-12 03:52:50,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=930818.0, ans=0.0 2023-10-12 03:52:51,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=930818.0, ans=0.035 2023-10-12 03:53:00,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=930864.6666666666, ans=10.0 2023-10-12 03:53:04,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=930864.6666666666, ans=0.125 2023-10-12 03:53:04,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=930864.6666666666, ans=0.125 2023-10-12 03:53:13,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=15.0 2023-10-12 03:53:16,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=930911.3333333334, ans=0.05 2023-10-12 03:53:17,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=930911.3333333334, ans=0.2 2023-10-12 03:53:34,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.72 vs. limit=15.0 2023-10-12 03:53:36,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.703e+02 1.902e+02 2.056e+02 2.906e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-12 03:53:42,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=931051.3333333334, ans=0.04949747468305833 2023-10-12 03:53:52,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931098.0, ans=0.1 2023-10-12 03:54:25,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=931238.0, ans=0.125 2023-10-12 03:54:29,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=931238.0, ans=0.125 2023-10-12 03:54:49,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=931331.3333333334, ans=0.2 2023-10-12 03:55:02,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=931378.0, ans=0.125 2023-10-12 03:55:07,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=931378.0, ans=0.125 2023-10-12 03:55:16,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=931424.6666666666, ans=0.0 2023-10-12 03:55:33,139 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.816e+02 2.076e+02 2.357e+02 3.127e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-12 03:55:41,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=931518.0, ans=0.0 2023-10-12 03:56:22,051 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.93 vs. limit=15.0 2023-10-12 03:56:33,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.58 vs. limit=15.0 2023-10-12 03:56:40,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931751.3333333334, ans=0.1 2023-10-12 03:56:48,300 INFO [train.py:1031] (0/4) Epoch 15, batch 8500, loss[loss=0.2126, simple_loss=0.2988, pruned_loss=0.06321, over 16902.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2867, pruned_loss=0.05302, over 32327081.63 frames. ], batch size: 130, lr: 2.33e-03, grad_scale: 16.0 2023-10-12 03:57:05,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=931844.6666666666, ans=0.0 2023-10-12 03:57:24,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=931938.0, ans=0.0 2023-10-12 03:57:24,973 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.20 vs. limit=15.0 2023-10-12 03:57:30,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.748e+02 1.958e+02 2.292e+02 3.324e+02, threshold=3.916e+02, percent-clipped=0.0 2023-10-12 03:57:41,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=931984.6666666666, ans=0.0 2023-10-12 03:57:41,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=931984.6666666666, ans=22.5 2023-10-12 03:57:53,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=932031.3333333334, ans=0.125 2023-10-12 03:58:00,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=932078.0, ans=0.0 2023-10-12 03:58:07,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=932078.0, ans=0.2 2023-10-12 03:58:17,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-10-12 03:58:21,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=932124.6666666666, ans=0.125 2023-10-12 03:58:22,095 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.51 vs. limit=10.0 2023-10-12 03:58:25,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=932171.3333333334, ans=0.0 2023-10-12 03:58:30,535 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-10-12 03:58:58,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=932264.6666666666, ans=0.2 2023-10-12 03:59:13,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=932311.3333333334, ans=0.0 2023-10-12 03:59:25,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=932358.0, ans=10.0 2023-10-12 03:59:27,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=932358.0, ans=0.2 2023-10-12 03:59:35,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.689e+02 1.910e+02 2.109e+02 2.909e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-12 04:00:55,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=932684.6666666666, ans=0.125 2023-10-12 04:00:59,550 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2023-10-12 04:01:00,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=932731.3333333334, ans=0.0 2023-10-12 04:01:12,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=932778.0, ans=0.125 2023-10-12 04:01:22,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=932824.6666666666, ans=0.2 2023-10-12 04:01:35,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.19 vs. limit=10.0 2023-10-12 04:01:39,966 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.690e+02 1.819e+02 2.014e+02 2.879e+02, threshold=3.638e+02, percent-clipped=0.0 2023-10-12 04:01:45,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.18 vs. limit=15.0 2023-10-12 04:01:50,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=932918.0, ans=0.025 2023-10-12 04:01:52,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=932918.0, ans=0.125 2023-10-12 04:01:55,801 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.74 vs. limit=10.0 2023-10-12 04:02:27,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=933058.0, ans=0.125 2023-10-12 04:02:54,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=933151.3333333334, ans=0.1 2023-10-12 04:02:55,492 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:03:12,699 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.50 vs. limit=15.0 2023-10-12 04:03:25,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=933291.3333333334, ans=0.0 2023-10-12 04:03:28,415 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-200000.pt 2023-10-12 04:03:39,139 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.317e+02 1.696e+02 1.840e+02 2.219e+02 3.082e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-12 04:03:57,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=933431.3333333334, ans=0.125 2023-10-12 04:04:05,429 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.89 vs. limit=22.5 2023-10-12 04:04:21,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=933524.6666666666, ans=0.125 2023-10-12 04:04:21,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=933524.6666666666, ans=0.5 2023-10-12 04:04:22,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-10-12 04:04:33,852 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.11 vs. limit=15.0 2023-10-12 04:04:35,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=933571.3333333334, ans=0.125 2023-10-12 04:04:35,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=933571.3333333334, ans=0.0 2023-10-12 04:04:59,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=933664.6666666666, ans=0.125 2023-10-12 04:05:06,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=933711.3333333334, ans=0.2 2023-10-12 04:05:17,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=933758.0, ans=0.2 2023-10-12 04:05:17,593 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.17 vs. limit=22.5 2023-10-12 04:05:22,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=933804.6666666666, ans=0.0 2023-10-12 04:05:29,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.760e+02 1.961e+02 2.268e+02 3.397e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-12 04:05:32,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=933851.3333333334, ans=0.0 2023-10-12 04:05:33,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=933851.3333333334, ans=0.125 2023-10-12 04:05:37,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=933851.3333333334, ans=0.125 2023-10-12 04:06:18,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=934038.0, ans=0.125 2023-10-12 04:06:20,102 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:06:32,384 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:06:32,522 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-10-12 04:06:40,684 INFO [train.py:1031] (0/4) Epoch 15, batch 9000, loss[loss=0.1746, simple_loss=0.2703, pruned_loss=0.03948, over 16362.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2861, pruned_loss=0.05286, over 32432448.94 frames. ], batch size: 50, lr: 2.32e-03, grad_scale: 32.0 2023-10-12 04:06:40,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=934131.3333333334, ans=0.125 2023-10-12 04:06:44,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=934131.3333333334, ans=0.125 2023-10-12 04:07:01,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=934224.6666666666, ans=0.07 2023-10-12 04:07:08,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.25 vs. limit=22.5 2023-10-12 04:07:10,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934224.6666666666, ans=0.1 2023-10-12 04:07:18,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.704e+02 1.924e+02 2.199e+02 3.028e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-12 04:07:26,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=934318.0, ans=0.125 2023-10-12 04:07:31,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=934318.0, ans=0.125 2023-10-12 04:07:31,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=934318.0, ans=0.125 2023-10-12 04:07:38,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.95 vs. limit=22.5 2023-10-12 04:08:07,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten.whitening_limit, batch_count=934504.6666666666, ans=15.0 2023-10-12 04:08:10,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=934504.6666666666, ans=0.1 2023-10-12 04:08:18,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=934551.3333333334, ans=0.125 2023-10-12 04:08:36,735 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:08:43,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=934644.6666666666, ans=0.125 2023-10-12 04:09:04,227 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.675e+02 1.807e+02 2.098e+02 2.989e+02, threshold=3.614e+02, percent-clipped=0.0 2023-10-12 04:09:05,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-10-12 04:09:15,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=934784.6666666666, ans=0.125 2023-10-12 04:09:15,735 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-10-12 04:09:24,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=934831.3333333334, ans=0.1 2023-10-12 04:09:49,193 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-10-12 04:09:54,943 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.91 vs. limit=6.0 2023-10-12 04:09:58,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=934971.3333333334, ans=0.125 2023-10-12 04:10:02,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=935018.0, ans=0.125 2023-10-12 04:10:02,346 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.96 vs. limit=15.0 2023-10-12 04:10:10,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=935018.0, ans=0.125 2023-10-12 04:10:23,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=935111.3333333334, ans=0.2 2023-10-12 04:10:34,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935158.0, ans=0.1 2023-10-12 04:10:42,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=935158.0, ans=0.1 2023-10-12 04:10:46,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=935204.6666666666, ans=0.05 2023-10-12 04:10:51,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.759e+02 1.900e+02 2.075e+02 2.991e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-12 04:11:00,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=935251.3333333334, ans=6.0 2023-10-12 04:11:01,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=935251.3333333334, ans=0.0 2023-10-12 04:11:15,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=935344.6666666666, ans=0.125 2023-10-12 04:11:36,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=935438.0, ans=0.2 2023-10-12 04:11:50,812 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:12:08,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=935531.3333333334, ans=0.1 2023-10-12 04:12:09,974 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.01 vs. limit=15.0 2023-10-12 04:12:14,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-10-12 04:12:17,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2023-10-12 04:12:26,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935624.6666666666, ans=0.1 2023-10-12 04:12:36,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.84 vs. limit=12.0 2023-10-12 04:12:42,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.750e+02 2.046e+02 2.411e+02 3.580e+02, threshold=4.092e+02, percent-clipped=0.0 2023-10-12 04:13:04,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=935764.6666666666, ans=0.1 2023-10-12 04:13:16,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=935811.3333333334, ans=0.0 2023-10-12 04:13:18,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=935811.3333333334, ans=0.125 2023-10-12 04:13:22,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=935811.3333333334, ans=0.2 2023-10-12 04:13:39,539 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-10-12 04:14:21,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=936044.6666666666, ans=0.125 2023-10-12 04:14:26,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=936091.3333333334, ans=0.0 2023-10-12 04:14:29,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=936091.3333333334, ans=0.09899494936611666 2023-10-12 04:14:38,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.75 vs. limit=15.0 2023-10-12 04:14:46,389 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.800e+02 1.944e+02 2.285e+02 3.152e+02, threshold=3.887e+02, percent-clipped=0.0 2023-10-12 04:14:47,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=936138.0, ans=0.125 2023-10-12 04:14:55,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=936184.6666666666, ans=0.1 2023-10-12 04:15:01,546 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.94 vs. limit=15.0 2023-10-12 04:15:32,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=936324.6666666666, ans=0.1 2023-10-12 04:15:43,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=936371.3333333334, ans=0.1 2023-10-12 04:15:57,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=936418.0, ans=0.125 2023-10-12 04:15:59,366 INFO [train.py:1031] (0/4) Epoch 15, batch 9500, loss[loss=0.2046, simple_loss=0.3019, pruned_loss=0.0537, over 16726.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2868, pruned_loss=0.05322, over 32488193.62 frames. ], batch size: 202, lr: 2.32e-03, grad_scale: 16.0 2023-10-12 04:16:01,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-10-12 04:16:12,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=936511.3333333334, ans=0.1 2023-10-12 04:16:13,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=936511.3333333334, ans=10.0 2023-10-12 04:16:25,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-10-12 04:16:35,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=936604.6666666666, ans=0.0 2023-10-12 04:16:40,033 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.777e+02 2.002e+02 2.178e+02 2.753e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-12 04:16:42,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=936651.3333333334, ans=0.0 2023-10-12 04:16:46,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=936651.3333333334, ans=0.0 2023-10-12 04:16:46,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=936651.3333333334, ans=0.0 2023-10-12 04:16:52,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=936651.3333333334, ans=0.0 2023-10-12 04:17:00,168 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-10-12 04:17:00,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=936698.0, ans=0.0 2023-10-12 04:17:18,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=936791.3333333334, ans=0.125 2023-10-12 04:17:24,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=936791.3333333334, ans=0.1 2023-10-12 04:17:26,762 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-10-12 04:17:40,680 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.33 vs. limit=15.0 2023-10-12 04:17:48,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2023-10-12 04:17:51,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=936931.3333333334, ans=0.0 2023-10-12 04:18:09,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=936978.0, ans=0.1 2023-10-12 04:18:17,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=937024.6666666666, ans=0.0 2023-10-12 04:18:18,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=937024.6666666666, ans=0.1 2023-10-12 04:18:28,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=937071.3333333334, ans=0.0 2023-10-12 04:18:34,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.815e+02 2.001e+02 2.371e+02 3.138e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-12 04:18:46,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=937118.0, ans=0.0 2023-10-12 04:19:11,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=937211.3333333334, ans=0.07 2023-10-12 04:19:19,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937258.0, ans=0.1 2023-10-12 04:19:40,956 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:19:45,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=937398.0, ans=0.125 2023-10-12 04:20:19,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=937538.0, ans=0.125 2023-10-12 04:20:25,029 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.688e+02 1.828e+02 1.993e+02 3.347e+02, threshold=3.656e+02, percent-clipped=0.0 2023-10-12 04:20:53,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=937678.0, ans=0.0 2023-10-12 04:20:56,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=937678.0, ans=0.0 2023-10-12 04:21:00,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=937678.0, ans=0.125 2023-10-12 04:21:26,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.70 vs. limit=22.5 2023-10-12 04:21:38,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=937864.6666666666, ans=0.1 2023-10-12 04:21:44,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=937864.6666666666, ans=0.2 2023-10-12 04:21:46,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=937864.6666666666, ans=0.125 2023-10-12 04:21:49,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=937864.6666666666, ans=0.0 2023-10-12 04:21:49,491 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.89 vs. limit=15.0 2023-10-12 04:21:58,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=937911.3333333334, ans=0.125 2023-10-12 04:22:18,458 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:22:20,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.707e+02 1.887e+02 2.106e+02 2.847e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-12 04:22:29,389 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.81 vs. limit=22.5 2023-10-12 04:22:34,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=938098.0, ans=0.125 2023-10-12 04:23:27,032 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:23:43,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=938378.0, ans=0.125 2023-10-12 04:23:56,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=938424.6666666666, ans=0.125 2023-10-12 04:23:59,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=938471.3333333334, ans=0.2 2023-10-12 04:24:05,672 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.748e+02 1.910e+02 2.064e+02 2.716e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-12 04:24:12,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=938518.0, ans=0.0 2023-10-12 04:24:39,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=938658.0, ans=0.125 2023-10-12 04:24:44,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=938658.0, ans=0.2 2023-10-12 04:24:59,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.44 vs. limit=22.5 2023-10-12 04:25:01,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=938751.3333333334, ans=0.1 2023-10-12 04:25:11,111 INFO [train.py:1031] (0/4) Epoch 15, batch 10000, loss[loss=0.1912, simple_loss=0.2876, pruned_loss=0.04742, over 16240.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2859, pruned_loss=0.05282, over 32554382.07 frames. ], batch size: 43, lr: 2.32e-03, grad_scale: 32.0 2023-10-12 04:25:52,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.702e+02 1.906e+02 2.079e+02 2.703e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-12 04:25:59,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=938984.6666666666, ans=0.125 2023-10-12 04:26:01,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=938984.6666666666, ans=0.125 2023-10-12 04:26:16,099 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:26:43,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=939171.3333333334, ans=0.125 2023-10-12 04:26:43,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=939171.3333333334, ans=0.0 2023-10-12 04:26:47,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=939171.3333333334, ans=0.0 2023-10-12 04:27:03,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=939264.6666666666, ans=0.125 2023-10-12 04:27:11,442 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:27:19,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=939311.3333333334, ans=0.1 2023-10-12 04:27:26,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=939311.3333333334, ans=0.125 2023-10-12 04:27:29,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.64 vs. limit=15.0 2023-10-12 04:27:36,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.14 vs. limit=15.0 2023-10-12 04:27:38,090 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.18 vs. limit=22.5 2023-10-12 04:27:39,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=939404.6666666666, ans=0.1 2023-10-12 04:27:43,732 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.63 vs. limit=6.0 2023-10-12 04:27:46,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.813e+02 2.045e+02 2.298e+02 3.543e+02, threshold=4.090e+02, percent-clipped=0.0 2023-10-12 04:28:04,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=939498.0, ans=0.2 2023-10-12 04:28:05,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=939498.0, ans=0.0 2023-10-12 04:28:21,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=939544.6666666666, ans=0.07 2023-10-12 04:28:26,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=939591.3333333334, ans=0.125 2023-10-12 04:28:32,657 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-10-12 04:28:34,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=939638.0, ans=0.1 2023-10-12 04:28:38,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.06 vs. limit=15.0 2023-10-12 04:28:54,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=939684.6666666666, ans=0.1 2023-10-12 04:29:00,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=939731.3333333334, ans=0.0 2023-10-12 04:29:24,446 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.10 vs. limit=22.5 2023-10-12 04:29:36,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=939871.3333333334, ans=0.2 2023-10-12 04:29:45,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.762e+02 1.938e+02 2.134e+02 3.322e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-12 04:29:49,333 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.43 vs. limit=12.0 2023-10-12 04:30:04,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=939964.6666666666, ans=0.0 2023-10-12 04:30:07,940 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-10-12 04:30:13,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-10-12 04:30:50,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=940151.3333333334, ans=0.125 2023-10-12 04:31:05,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=940198.0, ans=0.2 2023-10-12 04:31:27,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=940291.3333333334, ans=0.0 2023-10-12 04:31:39,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=940338.0, ans=0.0 2023-10-12 04:31:40,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.757e+02 1.932e+02 2.214e+02 2.714e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-12 04:31:43,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=940384.6666666666, ans=0.2 2023-10-12 04:31:56,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=940384.6666666666, ans=0.0 2023-10-12 04:31:58,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=940431.3333333334, ans=10.0 2023-10-12 04:32:00,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.56 vs. limit=22.5 2023-10-12 04:32:00,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=12.0 2023-10-12 04:32:12,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.60 vs. limit=15.0 2023-10-12 04:32:18,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=940478.0, ans=0.125 2023-10-12 04:32:33,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=940571.3333333334, ans=0.125 2023-10-12 04:32:43,571 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-10-12 04:33:10,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940711.3333333334, ans=0.1 2023-10-12 04:33:16,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=940711.3333333334, ans=0.0 2023-10-12 04:33:29,473 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.05 vs. limit=15.0 2023-10-12 04:33:30,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=940758.0, ans=0.125 2023-10-12 04:33:31,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=940758.0, ans=0.125 2023-10-12 04:33:42,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.776e+02 2.022e+02 2.467e+02 4.054e+02, threshold=4.045e+02, percent-clipped=2.0 2023-10-12 04:33:51,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=940851.3333333334, ans=0.035 2023-10-12 04:34:01,907 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.91 vs. limit=22.5 2023-10-12 04:34:08,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=940944.6666666666, ans=0.125 2023-10-12 04:34:12,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=940944.6666666666, ans=0.125 2023-10-12 04:34:19,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=940991.3333333334, ans=0.125 2023-10-12 04:34:49,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=941084.6666666666, ans=0.09899494936611666 2023-10-12 04:34:49,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=941084.6666666666, ans=0.09899494936611666 2023-10-12 04:34:51,534 INFO [train.py:1031] (0/4) Epoch 15, batch 10500, loss[loss=0.1873, simple_loss=0.2859, pruned_loss=0.04434, over 16850.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2864, pruned_loss=0.05284, over 32618868.99 frames. ], batch size: 155, lr: 2.32e-03, grad_scale: 32.0 2023-10-12 04:34:54,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=941131.3333333334, ans=0.2 2023-10-12 04:35:02,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=941178.0, ans=0.015 2023-10-12 04:35:07,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=941178.0, ans=0.2 2023-10-12 04:35:24,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=941271.3333333334, ans=0.0 2023-10-12 04:35:31,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.645e+02 1.828e+02 2.081e+02 2.557e+02, threshold=3.655e+02, percent-clipped=0.0 2023-10-12 04:35:50,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=941364.6666666666, ans=0.125 2023-10-12 04:36:06,451 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=12.0 2023-10-12 04:36:12,524 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:36:18,636 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.95 vs. limit=6.0 2023-10-12 04:36:26,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=941504.6666666666, ans=0.125 2023-10-12 04:36:31,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=941504.6666666666, ans=12.0 2023-10-12 04:36:39,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=941551.3333333334, ans=0.125 2023-10-12 04:36:45,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=941551.3333333334, ans=0.05 2023-10-12 04:36:49,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=941551.3333333334, ans=0.125 2023-10-12 04:36:52,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=941598.0, ans=0.035 2023-10-12 04:36:57,320 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=15.0 2023-10-12 04:37:03,909 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:37:34,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.759e+02 1.918e+02 2.153e+02 2.989e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-12 04:37:34,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=941738.0, ans=0.125 2023-10-12 04:37:38,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=941784.6666666666, ans=0.125 2023-10-12 04:37:48,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=941831.3333333334, ans=0.125 2023-10-12 04:38:22,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=941924.6666666666, ans=0.0 2023-10-12 04:38:28,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.16 vs. limit=15.0 2023-10-12 04:38:30,661 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.70 vs. limit=15.0 2023-10-12 04:38:30,666 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.12 vs. limit=15.0 2023-10-12 04:38:36,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-10-12 04:38:45,231 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.39 vs. limit=15.0 2023-10-12 04:39:12,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942111.3333333334, ans=0.1 2023-10-12 04:39:20,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=942158.0, ans=0.125 2023-10-12 04:39:35,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.747e+02 1.897e+02 2.079e+02 3.171e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-12 04:39:41,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942251.3333333334, ans=0.1 2023-10-12 04:39:55,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=942298.0, ans=10.0 2023-10-12 04:39:55,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=942298.0, ans=0.125 2023-10-12 04:39:56,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942298.0, ans=0.1 2023-10-12 04:40:01,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=942344.6666666666, ans=0.125 2023-10-12 04:40:04,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-10-12 04:40:15,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=942391.3333333334, ans=0.125 2023-10-12 04:40:24,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=942438.0, ans=0.125 2023-10-12 04:41:18,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=942624.6666666666, ans=0.2 2023-10-12 04:41:18,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.90 vs. limit=10.0 2023-10-12 04:41:20,336 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.03 vs. limit=15.0 2023-10-12 04:41:29,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.29 vs. limit=15.0 2023-10-12 04:41:29,738 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.689e+02 1.891e+02 2.126e+02 2.910e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-12 04:41:33,133 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=22.5 2023-10-12 04:41:47,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=942764.6666666666, ans=0.125 2023-10-12 04:41:56,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942811.3333333334, ans=0.1 2023-10-12 04:42:04,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=942811.3333333334, ans=0.2 2023-10-12 04:42:05,250 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.85 vs. limit=15.0 2023-10-12 04:42:33,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=942951.3333333334, ans=0.2 2023-10-12 04:42:38,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=942951.3333333334, ans=10.0 2023-10-12 04:42:38,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942951.3333333334, ans=0.1 2023-10-12 04:42:55,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=943044.6666666666, ans=0.125 2023-10-12 04:43:13,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=22.5 2023-10-12 04:43:22,353 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.614e+02 1.745e+02 1.899e+02 2.509e+02, threshold=3.489e+02, percent-clipped=0.0 2023-10-12 04:43:30,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=943184.6666666666, ans=0.1 2023-10-12 04:43:31,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=943184.6666666666, ans=0.09899494936611666 2023-10-12 04:43:31,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=943184.6666666666, ans=0.0 2023-10-12 04:43:37,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=943231.3333333334, ans=0.125 2023-10-12 04:43:42,187 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.95 vs. limit=10.0 2023-10-12 04:44:08,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=943371.3333333334, ans=15.0 2023-10-12 04:44:17,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=943371.3333333334, ans=0.125 2023-10-12 04:44:21,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=943418.0, ans=0.0 2023-10-12 04:44:23,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=943418.0, ans=0.0 2023-10-12 04:44:23,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=943418.0, ans=0.125 2023-10-12 04:44:30,096 INFO [train.py:1031] (0/4) Epoch 15, batch 11000, loss[loss=0.1822, simple_loss=0.2813, pruned_loss=0.04159, over 16906.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2866, pruned_loss=0.05303, over 32672746.77 frames. ], batch size: 87, lr: 2.31e-03, grad_scale: 16.0 2023-10-12 04:44:33,167 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:45:05,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=943604.6666666666, ans=0.125 2023-10-12 04:45:05,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=943604.6666666666, ans=0.125 2023-10-12 04:45:10,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=943604.6666666666, ans=0.125 2023-10-12 04:45:12,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.679e+02 1.854e+02 2.071e+02 3.107e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 04:45:28,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.71 vs. limit=15.0 2023-10-12 04:45:28,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=943698.0, ans=0.95 2023-10-12 04:46:01,768 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=12.0 2023-10-12 04:46:08,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.11 vs. limit=22.5 2023-10-12 04:46:17,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=943884.6666666666, ans=0.125 2023-10-12 04:46:45,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=943978.0, ans=0.0 2023-10-12 04:46:53,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.32 vs. limit=22.5 2023-10-12 04:47:18,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.231e+02 1.677e+02 1.834e+02 2.076e+02 3.870e+02, threshold=3.669e+02, percent-clipped=1.0 2023-10-12 04:47:19,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=944118.0, ans=0.0 2023-10-12 04:47:27,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.00 vs. limit=12.0 2023-10-12 04:48:43,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=944444.6666666666, ans=0.1 2023-10-12 04:49:07,364 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 04:49:11,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.686e+02 1.962e+02 2.213e+02 3.457e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-12 04:49:12,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=944584.6666666666, ans=0.0 2023-10-12 04:49:22,968 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.61 vs. limit=15.0 2023-10-12 04:49:25,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=944631.3333333334, ans=0.0 2023-10-12 04:49:27,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=944631.3333333334, ans=0.0 2023-10-12 04:49:43,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=944678.0, ans=10.0 2023-10-12 04:49:49,136 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.96 vs. limit=15.0 2023-10-12 04:49:56,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=944724.6666666666, ans=0.0 2023-10-12 04:50:05,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=944771.3333333334, ans=0.0 2023-10-12 04:50:28,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=15.0 2023-10-12 04:50:32,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=944864.6666666666, ans=0.125 2023-10-12 04:51:07,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.685e+02 1.820e+02 1.989e+02 2.541e+02, threshold=3.640e+02, percent-clipped=0.0 2023-10-12 04:51:16,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=945051.3333333334, ans=0.125 2023-10-12 04:51:20,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=945051.3333333334, ans=0.0 2023-10-12 04:51:20,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.54 vs. limit=15.0 2023-10-12 04:51:22,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=945098.0, ans=0.125 2023-10-12 04:51:24,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=945098.0, ans=0.2 2023-10-12 04:51:34,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=945144.6666666666, ans=0.0 2023-10-12 04:51:35,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=945144.6666666666, ans=0.07 2023-10-12 04:51:43,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=945144.6666666666, ans=0.0 2023-10-12 04:51:50,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=945191.3333333334, ans=0.1 2023-10-12 04:51:53,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=945191.3333333334, ans=0.2 2023-10-12 04:51:54,844 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-10-12 04:52:23,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=945331.3333333334, ans=0.09899494936611666 2023-10-12 04:52:24,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-10-12 04:52:36,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.95 vs. limit=22.5 2023-10-12 04:52:45,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=945424.6666666666, ans=0.1 2023-10-12 04:52:58,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=945471.3333333334, ans=0.1 2023-10-12 04:52:59,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=945471.3333333334, ans=0.2 2023-10-12 04:53:06,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.755e+02 1.894e+02 2.250e+02 3.690e+02, threshold=3.788e+02, percent-clipped=1.0 2023-10-12 04:53:25,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=945564.6666666666, ans=0.1 2023-10-12 04:53:42,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=945658.0, ans=0.0 2023-10-12 04:53:50,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=945704.6666666666, ans=0.2 2023-10-12 04:53:58,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=945704.6666666666, ans=0.0 2023-10-12 04:54:12,409 INFO [train.py:1031] (0/4) Epoch 15, batch 11500, loss[loss=0.1946, simple_loss=0.2893, pruned_loss=0.04997, over 16904.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2862, pruned_loss=0.0528, over 32712703.45 frames. ], batch size: 77, lr: 2.31e-03, grad_scale: 32.0 2023-10-12 04:54:18,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.34 vs. limit=10.0 2023-10-12 04:54:22,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=945844.6666666666, ans=0.0 2023-10-12 04:54:30,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=945844.6666666666, ans=0.1 2023-10-12 04:54:43,896 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.12 vs. limit=22.5 2023-10-12 04:54:46,034 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.19 vs. limit=15.0 2023-10-12 04:54:54,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.749e+02 1.922e+02 2.149e+02 2.799e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-12 04:55:09,063 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.47 vs. limit=22.5 2023-10-12 04:55:24,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=22.5 2023-10-12 04:55:48,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=946171.3333333334, ans=0.0 2023-10-12 04:56:05,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=946218.0, ans=0.0 2023-10-12 04:56:17,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=946264.6666666666, ans=0.2 2023-10-12 04:56:33,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.00 vs. limit=15.0 2023-10-12 04:56:46,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=946404.6666666666, ans=0.0 2023-10-12 04:56:50,984 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.80 vs. limit=22.5 2023-10-12 04:56:52,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-10-12 04:56:56,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=946404.6666666666, ans=0.0 2023-10-12 04:56:56,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.642e+02 1.818e+02 1.985e+02 2.610e+02, threshold=3.636e+02, percent-clipped=0.0 2023-10-12 04:57:01,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=946451.3333333334, ans=0.0 2023-10-12 04:57:19,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=946544.6666666666, ans=0.125 2023-10-12 04:57:51,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=946684.6666666666, ans=0.125 2023-10-12 04:58:08,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=946731.3333333334, ans=0.0 2023-10-12 04:58:10,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=946731.3333333334, ans=0.0 2023-10-12 04:58:10,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=946731.3333333334, ans=0.125 2023-10-12 04:58:28,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=946824.6666666666, ans=0.1 2023-10-12 04:58:30,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=946824.6666666666, ans=0.0 2023-10-12 04:58:44,298 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.302e+02 1.825e+02 1.986e+02 2.243e+02 3.156e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-12 04:58:46,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=946918.0, ans=0.0 2023-10-12 04:59:02,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=946964.6666666666, ans=0.1 2023-10-12 04:59:24,311 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.03 vs. limit=12.0 2023-10-12 04:59:29,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=947058.0, ans=0.2 2023-10-12 04:59:46,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=947104.6666666666, ans=0.125 2023-10-12 05:00:50,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=947338.0, ans=0.2 2023-10-12 05:00:56,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.759e+02 1.969e+02 2.208e+02 3.067e+02, threshold=3.937e+02, percent-clipped=0.0 2023-10-12 05:01:08,025 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.23 vs. limit=12.0 2023-10-12 05:01:11,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=947431.3333333334, ans=15.0 2023-10-12 05:01:36,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.94 vs. limit=10.0 2023-10-12 05:02:16,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=947664.6666666666, ans=0.0 2023-10-12 05:02:31,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=947758.0, ans=0.09899494936611666 2023-10-12 05:02:45,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=947804.6666666666, ans=0.2 2023-10-12 05:02:50,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=947804.6666666666, ans=0.125 2023-10-12 05:02:51,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.845e+02 2.014e+02 2.346e+02 3.364e+02, threshold=4.028e+02, percent-clipped=0.0 2023-10-12 05:02:57,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=947851.3333333334, ans=0.125 2023-10-12 05:03:00,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=947851.3333333334, ans=0.125 2023-10-12 05:03:01,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=947851.3333333334, ans=0.04949747468305833 2023-10-12 05:03:17,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=947944.6666666666, ans=0.125 2023-10-12 05:03:35,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=947991.3333333334, ans=0.1 2023-10-12 05:03:59,642 INFO [train.py:1031] (0/4) Epoch 15, batch 12000, loss[loss=0.212, simple_loss=0.2738, pruned_loss=0.0751, over 12767.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2862, pruned_loss=0.05254, over 32751256.37 frames. ], batch size: 440, lr: 2.31e-03, grad_scale: 32.0 2023-10-12 05:04:42,995 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.764e+02 1.914e+02 2.129e+02 3.192e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-12 05:04:43,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=948318.0, ans=0.2 2023-10-12 05:04:49,174 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.85 vs. limit=10.0 2023-10-12 05:05:02,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=948364.6666666666, ans=0.1 2023-10-12 05:05:19,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=948411.3333333334, ans=0.125 2023-10-12 05:05:28,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=948458.0, ans=15.0 2023-10-12 05:05:37,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=948504.6666666666, ans=0.125 2023-10-12 05:05:45,549 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.65 vs. limit=15.0 2023-10-12 05:06:07,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=948644.6666666666, ans=0.2 2023-10-12 05:06:17,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=948691.3333333334, ans=0.125 2023-10-12 05:06:24,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.32 vs. limit=15.0 2023-10-12 05:06:26,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=948691.3333333334, ans=0.125 2023-10-12 05:06:41,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.670e+02 1.838e+02 2.068e+02 3.005e+02, threshold=3.675e+02, percent-clipped=0.0 2023-10-12 05:06:49,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=948784.6666666666, ans=0.0 2023-10-12 05:06:52,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=948831.3333333334, ans=0.0 2023-10-12 05:06:58,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=948831.3333333334, ans=0.125 2023-10-12 05:07:01,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=948878.0, ans=0.125 2023-10-12 05:07:08,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=948878.0, ans=0.125 2023-10-12 05:07:19,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=948924.6666666666, ans=0.95 2023-10-12 05:07:26,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.01 vs. limit=22.5 2023-10-12 05:07:28,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=948971.3333333334, ans=0.125 2023-10-12 05:07:43,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=949064.6666666666, ans=0.0 2023-10-12 05:07:46,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=949064.6666666666, ans=0.125 2023-10-12 05:07:53,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=949064.6666666666, ans=0.125 2023-10-12 05:08:08,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=949158.0, ans=0.125 2023-10-12 05:08:11,036 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-10-12 05:08:30,263 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.718e+02 1.869e+02 2.137e+02 3.010e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-12 05:08:38,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=949251.3333333334, ans=0.125 2023-10-12 05:08:45,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=949298.0, ans=0.0 2023-10-12 05:08:48,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=949298.0, ans=0.1 2023-10-12 05:08:53,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=949344.6666666666, ans=0.0 2023-10-12 05:08:59,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=949344.6666666666, ans=0.125 2023-10-12 05:09:09,800 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-10-12 05:09:25,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=949484.6666666666, ans=15.0 2023-10-12 05:09:44,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=949531.3333333334, ans=0.125 2023-10-12 05:09:58,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-10-12 05:10:04,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=949624.6666666666, ans=0.95 2023-10-12 05:10:10,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=949671.3333333334, ans=0.0 2023-10-12 05:10:11,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=949671.3333333334, ans=0.0 2023-10-12 05:10:15,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=949671.3333333334, ans=0.125 2023-10-12 05:10:18,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=949671.3333333334, ans=0.125 2023-10-12 05:10:21,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.772e+02 1.899e+02 2.241e+02 2.900e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-12 05:10:36,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=949764.6666666666, ans=0.0 2023-10-12 05:10:56,236 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:11:12,795 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-10-12 05:11:29,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=949951.3333333334, ans=0.2 2023-10-12 05:11:30,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=949951.3333333334, ans=0.1 2023-10-12 05:11:41,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=949998.0, ans=0.125 2023-10-12 05:11:46,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950044.6666666666, ans=0.1 2023-10-12 05:11:50,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=950044.6666666666, ans=0.0 2023-10-12 05:11:52,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=950091.3333333334, ans=0.125 2023-10-12 05:12:04,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=950138.0, ans=0.125 2023-10-12 05:12:15,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=950184.6666666666, ans=0.125 2023-10-12 05:12:16,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.672e+02 1.825e+02 2.013e+02 2.874e+02, threshold=3.649e+02, percent-clipped=0.0 2023-10-12 05:12:21,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=950184.6666666666, ans=0.0 2023-10-12 05:12:24,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.48 vs. limit=10.0 2023-10-12 05:12:31,394 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-10-12 05:12:46,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=950278.0, ans=0.125 2023-10-12 05:12:52,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=950324.6666666666, ans=0.0 2023-10-12 05:12:52,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=950324.6666666666, ans=0.05 2023-10-12 05:13:08,728 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.62 vs. limit=12.0 2023-10-12 05:13:25,002 INFO [train.py:1031] (0/4) Epoch 15, batch 12500, loss[loss=0.1893, simple_loss=0.2775, pruned_loss=0.05057, over 16875.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.286, pruned_loss=0.05272, over 32748498.17 frames. ], batch size: 130, lr: 2.30e-03, grad_scale: 8.0 2023-10-12 05:13:42,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=950511.3333333334, ans=0.125 2023-10-12 05:13:49,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=950558.0, ans=0.0 2023-10-12 05:13:51,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.75 vs. limit=15.0 2023-10-12 05:14:01,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=950604.6666666666, ans=0.0 2023-10-12 05:14:05,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=950604.6666666666, ans=0.125 2023-10-12 05:14:09,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.293e+02 1.658e+02 1.823e+02 2.079e+02 2.963e+02, threshold=3.645e+02, percent-clipped=0.0 2023-10-12 05:14:28,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=950698.0, ans=0.125 2023-10-12 05:14:32,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=950744.6666666666, ans=0.0 2023-10-12 05:14:37,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=950744.6666666666, ans=0.125 2023-10-12 05:15:10,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=950884.6666666666, ans=0.1 2023-10-12 05:15:21,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=950931.3333333334, ans=0.1 2023-10-12 05:15:48,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=951024.6666666666, ans=0.125 2023-10-12 05:15:55,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=951071.3333333334, ans=0.0 2023-10-12 05:15:55,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=951071.3333333334, ans=0.125 2023-10-12 05:15:57,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=951071.3333333334, ans=0.2 2023-10-12 05:16:04,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=951118.0, ans=0.02 2023-10-12 05:16:05,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.702e+02 1.864e+02 2.060e+02 3.066e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-12 05:16:08,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.73 vs. limit=22.5 2023-10-12 05:16:10,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.52 vs. limit=15.0 2023-10-12 05:16:18,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=951164.6666666666, ans=0.0 2023-10-12 05:17:02,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=951351.3333333334, ans=0.09899494936611666 2023-10-12 05:17:46,133 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:17:51,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=951584.6666666666, ans=0.0 2023-10-12 05:17:54,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.738e+02 1.912e+02 2.127e+02 3.551e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-12 05:17:59,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=951584.6666666666, ans=0.2 2023-10-12 05:18:26,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=951724.6666666666, ans=0.125 2023-10-12 05:18:34,330 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.21 vs. limit=15.0 2023-10-12 05:18:36,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=951771.3333333334, ans=0.1 2023-10-12 05:18:40,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=951771.3333333334, ans=0.125 2023-10-12 05:18:44,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=951771.3333333334, ans=0.0 2023-10-12 05:18:58,158 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.91 vs. limit=15.0 2023-10-12 05:19:13,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=951911.3333333334, ans=0.125 2023-10-12 05:19:24,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=951958.0, ans=0.125 2023-10-12 05:19:31,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=952004.6666666666, ans=0.0 2023-10-12 05:19:34,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=952004.6666666666, ans=0.125 2023-10-12 05:19:36,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=952004.6666666666, ans=0.125 2023-10-12 05:19:37,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=952004.6666666666, ans=0.0 2023-10-12 05:19:39,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952004.6666666666, ans=0.1 2023-10-12 05:19:42,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.780e+02 1.992e+02 2.295e+02 3.597e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-12 05:19:50,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=952051.3333333334, ans=0.2 2023-10-12 05:20:02,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=952098.0, ans=0.0 2023-10-12 05:20:28,068 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-10-12 05:21:13,330 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.00 vs. limit=15.0 2023-10-12 05:21:20,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=952471.3333333334, ans=0.0 2023-10-12 05:21:33,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=952518.0, ans=0.125 2023-10-12 05:21:33,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.728e+02 1.930e+02 2.211e+02 2.796e+02, threshold=3.861e+02, percent-clipped=0.0 2023-10-12 05:21:36,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=952518.0, ans=0.125 2023-10-12 05:21:36,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=952518.0, ans=0.2 2023-10-12 05:21:44,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=952564.6666666666, ans=0.0 2023-10-12 05:21:44,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=952564.6666666666, ans=0.125 2023-10-12 05:21:45,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=952564.6666666666, ans=0.0 2023-10-12 05:21:46,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=952564.6666666666, ans=0.07 2023-10-12 05:21:57,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=952611.3333333334, ans=0.125 2023-10-12 05:22:24,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=952704.6666666666, ans=0.125 2023-10-12 05:22:25,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952751.3333333334, ans=0.1 2023-10-12 05:22:36,441 INFO [train.py:1031] (0/4) Epoch 15, batch 13000, loss[loss=0.2092, simple_loss=0.2951, pruned_loss=0.06162, over 16957.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2867, pruned_loss=0.05292, over 32762337.12 frames. ], batch size: 156, lr: 2.30e-03, grad_scale: 16.0 2023-10-12 05:22:37,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=952798.0, ans=0.125 2023-10-12 05:22:47,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=952844.6666666666, ans=0.0 2023-10-12 05:22:57,694 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:23:05,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=952891.3333333334, ans=0.125 2023-10-12 05:23:23,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=952938.0, ans=0.0 2023-10-12 05:23:31,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.699e+02 1.882e+02 2.096e+02 5.250e+02, threshold=3.765e+02, percent-clipped=1.0 2023-10-12 05:23:42,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=953031.3333333334, ans=0.125 2023-10-12 05:23:43,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=953031.3333333334, ans=0.2 2023-10-12 05:24:05,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=953124.6666666666, ans=0.0 2023-10-12 05:24:19,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=953171.3333333334, ans=0.125 2023-10-12 05:24:24,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=953171.3333333334, ans=12.0 2023-10-12 05:24:27,117 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.19 vs. limit=10.0 2023-10-12 05:24:49,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953311.3333333334, ans=0.1 2023-10-12 05:24:56,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.88 vs. limit=12.0 2023-10-12 05:24:56,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=15.0 2023-10-12 05:24:59,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=953311.3333333334, ans=0.125 2023-10-12 05:25:04,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=953358.0, ans=0.0 2023-10-12 05:25:06,022 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-10-12 05:25:11,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=953404.6666666666, ans=0.0 2023-10-12 05:25:13,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=953404.6666666666, ans=0.0 2023-10-12 05:25:17,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953404.6666666666, ans=0.1 2023-10-12 05:25:26,457 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.657e+02 1.796e+02 1.995e+02 2.607e+02, threshold=3.593e+02, percent-clipped=0.0 2023-10-12 05:25:26,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=953451.3333333334, ans=0.125 2023-10-12 05:26:06,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=953591.3333333334, ans=0.0 2023-10-12 05:26:19,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953638.0, ans=0.1 2023-10-12 05:26:37,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=953731.3333333334, ans=0.0 2023-10-12 05:26:41,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=953731.3333333334, ans=0.0 2023-10-12 05:27:06,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=953824.6666666666, ans=0.1 2023-10-12 05:27:19,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.44 vs. limit=15.0 2023-10-12 05:27:31,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.649e+02 1.804e+02 1.974e+02 2.803e+02, threshold=3.608e+02, percent-clipped=0.0 2023-10-12 05:27:33,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=953918.0, ans=0.2 2023-10-12 05:27:39,512 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:28:16,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.16 vs. limit=15.0 2023-10-12 05:28:18,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=954104.6666666666, ans=0.125 2023-10-12 05:28:25,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=954151.3333333334, ans=0.125 2023-10-12 05:28:26,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=954151.3333333334, ans=0.0 2023-10-12 05:28:27,287 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=15.0 2023-10-12 05:28:36,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.35 vs. limit=15.0 2023-10-12 05:28:38,060 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.97 vs. limit=22.5 2023-10-12 05:28:56,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=954291.3333333334, ans=0.125 2023-10-12 05:29:13,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=954338.0, ans=0.1 2023-10-12 05:29:19,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=954384.6666666666, ans=0.0 2023-10-12 05:29:19,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=954384.6666666666, ans=0.0 2023-10-12 05:29:19,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.811e+02 1.969e+02 2.169e+02 3.755e+02, threshold=3.937e+02, percent-clipped=1.0 2023-10-12 05:29:23,066 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=22.5 2023-10-12 05:29:38,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=954478.0, ans=0.1 2023-10-12 05:29:45,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=954478.0, ans=10.0 2023-10-12 05:29:59,190 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:30:02,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=954571.3333333334, ans=0.05 2023-10-12 05:30:14,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=954618.0, ans=0.125 2023-10-12 05:30:20,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=954618.0, ans=0.125 2023-10-12 05:30:27,429 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=15.0 2023-10-12 05:30:45,966 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:30:47,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=954758.0, ans=0.125 2023-10-12 05:31:09,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.696e+02 1.863e+02 2.056e+02 3.087e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-12 05:31:25,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=954898.0, ans=0.0 2023-10-12 05:31:44,128 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.59 vs. limit=5.0 2023-10-12 05:31:46,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=954991.3333333334, ans=15.0 2023-10-12 05:31:54,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=955038.0, ans=0.125 2023-10-12 05:31:55,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=955038.0, ans=0.1 2023-10-12 05:31:59,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=955038.0, ans=0.125 2023-10-12 05:32:07,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=955084.6666666666, ans=0.125 2023-10-12 05:32:12,654 INFO [train.py:1031] (0/4) Epoch 15, batch 13500, loss[loss=0.1901, simple_loss=0.2837, pruned_loss=0.04829, over 16908.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2858, pruned_loss=0.05249, over 32797599.59 frames. ], batch size: 116, lr: 2.30e-03, grad_scale: 32.0 2023-10-12 05:32:45,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=955271.3333333334, ans=0.0 2023-10-12 05:32:58,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.671e+02 1.841e+02 2.069e+02 3.350e+02, threshold=3.682e+02, percent-clipped=0.0 2023-10-12 05:33:03,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=955318.0, ans=0.1 2023-10-12 05:33:18,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=955411.3333333334, ans=0.125 2023-10-12 05:33:38,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.14 vs. limit=15.0 2023-10-12 05:34:00,530 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-12 05:34:03,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=955598.0, ans=0.125 2023-10-12 05:34:19,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=955644.6666666666, ans=0.125 2023-10-12 05:34:24,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=955691.3333333334, ans=0.0 2023-10-12 05:34:33,125 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:34:34,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=955738.0, ans=0.1 2023-10-12 05:34:43,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=955784.6666666666, ans=0.125 2023-10-12 05:34:43,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.705e+02 1.879e+02 2.090e+02 3.403e+02, threshold=3.758e+02, percent-clipped=0.0 2023-10-12 05:34:55,789 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-15.pt 2023-10-12 05:35:27,642 INFO [train.py:1031] (0/4) Epoch 16, batch 0, loss[loss=0.1629, simple_loss=0.2583, pruned_loss=0.03374, over 16636.00 frames. ], tot_loss[loss=0.1629, simple_loss=0.2583, pruned_loss=0.03374, over 16636.00 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 32.0 2023-10-12 05:35:27,643 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-12 05:35:36,633 INFO [train.py:1063] (0/4) Epoch 16, validation: loss=0.2168, simple_loss=0.3041, pruned_loss=0.06475, over 1020973.00 frames. 2023-10-12 05:35:36,633 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-12 05:35:40,956 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.98 vs. limit=15.0 2023-10-12 05:35:43,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=955854.6666666666, ans=0.07 2023-10-12 05:35:45,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=955854.6666666666, ans=0.125 2023-10-12 05:36:21,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=955994.6666666666, ans=0.2 2023-10-12 05:37:09,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=956228.0, ans=0.1 2023-10-12 05:37:16,763 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.710e+02 1.856e+02 2.150e+02 3.512e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-12 05:37:43,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.69 vs. limit=22.5 2023-10-12 05:37:50,481 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.27 vs. limit=15.0 2023-10-12 05:37:56,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=956414.6666666666, ans=0.125 2023-10-12 05:38:03,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=956461.3333333334, ans=0.1 2023-10-12 05:38:29,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=956554.6666666666, ans=0.1 2023-10-12 05:38:41,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=956601.3333333334, ans=0.125 2023-10-12 05:38:43,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=956601.3333333334, ans=0.2 2023-10-12 05:38:47,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=956648.0, ans=0.1 2023-10-12 05:39:01,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=956694.6666666666, ans=0.125 2023-10-12 05:39:06,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.679e+02 1.799e+02 2.010e+02 3.014e+02, threshold=3.598e+02, percent-clipped=0.0 2023-10-12 05:39:13,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=956741.3333333334, ans=0.1 2023-10-12 05:39:21,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=956788.0, ans=0.0 2023-10-12 05:39:36,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=956834.6666666666, ans=0.125 2023-10-12 05:39:37,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.44 vs. limit=15.0 2023-10-12 05:39:52,437 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-12 05:39:55,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=956928.0, ans=0.125 2023-10-12 05:40:17,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=957021.3333333334, ans=0.125 2023-10-12 05:40:34,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=957068.0, ans=0.125 2023-10-12 05:40:38,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=957114.6666666666, ans=0.0 2023-10-12 05:40:38,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=957114.6666666666, ans=0.1 2023-10-12 05:40:44,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=957114.6666666666, ans=0.04949747468305833 2023-10-12 05:40:57,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.751e+02 1.952e+02 2.208e+02 3.252e+02, threshold=3.903e+02, percent-clipped=0.0 2023-10-12 05:41:00,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=957208.0, ans=0.0 2023-10-12 05:41:00,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=957208.0, ans=0.125 2023-10-12 05:41:10,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=957254.6666666666, ans=0.1 2023-10-12 05:41:15,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=957254.6666666666, ans=0.125 2023-10-12 05:41:25,781 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-10-12 05:42:20,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=957534.6666666666, ans=0.0 2023-10-12 05:42:45,736 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.745e+02 1.983e+02 2.279e+02 3.520e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-12 05:43:00,539 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-10-12 05:43:18,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=957768.0, ans=0.125 2023-10-12 05:43:18,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=957768.0, ans=0.125 2023-10-12 05:43:49,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=957908.0, ans=0.1 2023-10-12 05:43:52,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=957908.0, ans=0.2 2023-10-12 05:43:58,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=957954.6666666666, ans=0.125 2023-10-12 05:44:06,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.81 vs. limit=12.0 2023-10-12 05:44:08,310 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:44:09,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=958001.3333333334, ans=0.0 2023-10-12 05:44:18,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=958001.3333333334, ans=0.125 2023-10-12 05:44:39,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.99 vs. limit=6.0 2023-10-12 05:44:42,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.718e+02 1.856e+02 2.058e+02 2.918e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-12 05:44:51,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=958141.3333333334, ans=0.1 2023-10-12 05:44:55,226 INFO [train.py:1031] (0/4) Epoch 16, batch 500, loss[loss=0.216, simple_loss=0.2834, pruned_loss=0.07432, over 15635.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2845, pruned_loss=0.05197, over 7265146.71 frames. ], batch size: 350, lr: 2.22e-03, grad_scale: 32.0 2023-10-12 05:45:02,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=958188.0, ans=0.2 2023-10-12 05:46:03,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=958468.0, ans=0.125 2023-10-12 05:46:03,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=958468.0, ans=0.2 2023-10-12 05:46:18,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=958514.6666666666, ans=0.125 2023-10-12 05:46:34,071 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.771e+02 1.982e+02 2.231e+02 2.910e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-12 05:46:35,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=958608.0, ans=0.125 2023-10-12 05:46:54,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=958654.6666666666, ans=0.125 2023-10-12 05:46:54,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=958654.6666666666, ans=10.0 2023-10-12 05:47:04,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=958701.3333333334, ans=0.125 2023-10-12 05:47:06,283 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:47:23,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=958794.6666666666, ans=0.0 2023-10-12 05:47:54,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=958934.6666666666, ans=0.0 2023-10-12 05:48:01,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=958934.6666666666, ans=0.0 2023-10-12 05:48:07,273 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:48:24,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=959028.0, ans=0.125 2023-10-12 05:48:25,376 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:48:28,883 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.796e+02 2.036e+02 2.346e+02 3.753e+02, threshold=4.071e+02, percent-clipped=0.0 2023-10-12 05:48:34,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=959074.6666666666, ans=0.0 2023-10-12 05:49:00,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=959214.6666666666, ans=0.0 2023-10-12 05:49:15,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=959261.3333333334, ans=0.125 2023-10-12 05:49:15,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=959261.3333333334, ans=0.0 2023-10-12 05:49:17,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=959261.3333333334, ans=0.125 2023-10-12 05:49:37,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=959354.6666666666, ans=0.125 2023-10-12 05:49:56,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=959401.3333333334, ans=0.0 2023-10-12 05:49:56,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=959401.3333333334, ans=0.0 2023-10-12 05:49:59,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=959448.0, ans=0.1 2023-10-12 05:50:16,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=959494.6666666666, ans=0.125 2023-10-12 05:50:17,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=959494.6666666666, ans=0.0 2023-10-12 05:50:20,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.755e+02 1.970e+02 2.252e+02 3.215e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-12 05:50:21,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=959541.3333333334, ans=0.0 2023-10-12 05:50:36,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=959588.0, ans=0.1 2023-10-12 05:50:36,523 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.89 vs. limit=22.5 2023-10-12 05:50:51,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=959634.6666666666, ans=0.2 2023-10-12 05:51:05,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.72 vs. limit=15.0 2023-10-12 05:51:10,802 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:51:12,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=959728.0, ans=0.125 2023-10-12 05:51:32,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-10-12 05:51:40,710 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.31 vs. limit=15.0 2023-10-12 05:51:49,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=959868.0, ans=0.0 2023-10-12 05:51:58,466 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.82 vs. limit=22.5 2023-10-12 05:52:16,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.754e+02 1.921e+02 2.186e+02 3.689e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-12 05:52:26,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=960054.6666666666, ans=0.0 2023-10-12 05:52:46,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.28 vs. limit=6.0 2023-10-12 05:52:51,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960148.0, ans=0.1 2023-10-12 05:52:58,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=960194.6666666666, ans=0.125 2023-10-12 05:53:07,365 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.85 vs. limit=15.0 2023-10-12 05:53:39,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=960334.6666666666, ans=0.125 2023-10-12 05:53:41,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=960334.6666666666, ans=0.125 2023-10-12 05:54:00,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-10-12 05:54:02,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=960428.0, ans=0.2 2023-10-12 05:54:02,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=960428.0, ans=0.125 2023-10-12 05:54:03,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=960428.0, ans=0.125 2023-10-12 05:54:08,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.722e+02 1.867e+02 2.124e+02 3.433e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-12 05:54:14,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=960474.6666666666, ans=0.0 2023-10-12 05:54:18,378 INFO [train.py:1031] (0/4) Epoch 16, batch 1000, loss[loss=0.1897, simple_loss=0.2896, pruned_loss=0.04487, over 16931.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2861, pruned_loss=0.05282, over 12924596.26 frames. ], batch size: 93, lr: 2.21e-03, grad_scale: 16.0 2023-10-12 05:54:18,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=960521.3333333334, ans=0.125 2023-10-12 05:54:29,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=960568.0, ans=0.0 2023-10-12 05:54:46,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=960614.6666666666, ans=0.125 2023-10-12 05:55:23,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=960801.3333333334, ans=0.0 2023-10-12 05:55:29,284 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.65 vs. limit=15.0 2023-10-12 05:55:37,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=960848.0, ans=0.5 2023-10-12 05:55:43,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=960894.6666666666, ans=0.125 2023-10-12 05:55:53,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.660e+02 1.854e+02 2.039e+02 2.957e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 05:55:57,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=960941.3333333334, ans=0.0 2023-10-12 05:56:05,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960988.0, ans=0.1 2023-10-12 05:56:09,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=960988.0, ans=0.0 2023-10-12 05:56:19,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=961034.6666666666, ans=0.125 2023-10-12 05:56:24,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=961034.6666666666, ans=0.125 2023-10-12 05:56:39,338 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.35 vs. limit=22.5 2023-10-12 05:56:43,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=961128.0, ans=0.125 2023-10-12 05:56:48,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=961128.0, ans=0.2 2023-10-12 05:57:31,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=961268.0, ans=0.125 2023-10-12 05:57:44,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.24 vs. limit=15.0 2023-10-12 05:57:48,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=961361.3333333334, ans=0.125 2023-10-12 05:57:50,049 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-10-12 05:57:51,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=961361.3333333334, ans=0.015 2023-10-12 05:57:57,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.323e+02 1.715e+02 1.883e+02 2.074e+02 2.784e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-12 05:58:03,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=961408.0, ans=0.0 2023-10-12 05:58:11,560 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:58:15,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=961454.6666666666, ans=0.1 2023-10-12 05:58:33,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=961548.0, ans=0.1 2023-10-12 05:58:37,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=961548.0, ans=0.125 2023-10-12 05:58:41,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=961594.6666666666, ans=0.05 2023-10-12 05:58:47,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=961594.6666666666, ans=0.125 2023-10-12 05:58:54,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=961641.3333333334, ans=0.1 2023-10-12 05:58:57,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=961641.3333333334, ans=0.0 2023-10-12 05:59:02,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=12.0 2023-10-12 05:59:05,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=961688.0, ans=0.125 2023-10-12 05:59:23,481 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=12.0 2023-10-12 05:59:28,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=961781.3333333334, ans=0.0 2023-10-12 05:59:33,236 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.32 vs. limit=15.0 2023-10-12 05:59:39,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=961828.0, ans=0.1 2023-10-12 05:59:41,927 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 05:59:44,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.340e+02 1.719e+02 1.920e+02 2.178e+02 2.949e+02, threshold=3.841e+02, percent-clipped=0.0 2023-10-12 06:00:03,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=961921.3333333334, ans=0.1 2023-10-12 06:00:10,955 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.82 vs. limit=15.0 2023-10-12 06:00:30,617 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:00:31,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=962061.3333333334, ans=0.0 2023-10-12 06:00:39,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=962108.0, ans=0.0 2023-10-12 06:00:50,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=962154.6666666666, ans=0.125 2023-10-12 06:00:53,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=962154.6666666666, ans=0.0 2023-10-12 06:00:55,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=962154.6666666666, ans=0.125 2023-10-12 06:00:58,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=962201.3333333334, ans=0.025 2023-10-12 06:01:05,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=962201.3333333334, ans=0.035 2023-10-12 06:01:20,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=962294.6666666666, ans=0.125 2023-10-12 06:01:33,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.305e+02 1.691e+02 1.874e+02 2.047e+02 3.026e+02, threshold=3.748e+02, percent-clipped=0.0 2023-10-12 06:01:40,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=962341.3333333334, ans=0.0 2023-10-12 06:01:41,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=962388.0, ans=0.0 2023-10-12 06:01:43,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=962388.0, ans=0.2 2023-10-12 06:01:47,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=962388.0, ans=0.1 2023-10-12 06:02:00,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=962434.6666666666, ans=0.1 2023-10-12 06:02:09,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=962481.3333333334, ans=0.1 2023-10-12 06:02:28,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-10-12 06:02:35,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=962574.6666666666, ans=0.125 2023-10-12 06:02:48,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=962621.3333333334, ans=0.125 2023-10-12 06:03:10,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=962714.6666666666, ans=0.1 2023-10-12 06:03:13,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=962761.3333333334, ans=0.1 2023-10-12 06:03:25,782 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.747e+02 1.954e+02 2.196e+02 3.606e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-12 06:03:25,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=962808.0, ans=0.0 2023-10-12 06:03:33,822 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:03:37,666 INFO [train.py:1031] (0/4) Epoch 16, batch 1500, loss[loss=0.198, simple_loss=0.2895, pruned_loss=0.05329, over 17028.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.2851, pruned_loss=0.05215, over 17368796.94 frames. ], batch size: 117, lr: 2.21e-03, grad_scale: 16.0 2023-10-12 06:03:52,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=962901.3333333334, ans=0.2 2023-10-12 06:03:58,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=962901.3333333334, ans=0.0 2023-10-12 06:04:01,932 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.79 vs. limit=22.5 2023-10-12 06:04:34,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=963088.0, ans=0.0 2023-10-12 06:04:35,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-10-12 06:04:57,450 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:05:21,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.704e+02 1.880e+02 2.121e+02 3.433e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-12 06:05:52,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=963368.0, ans=0.1 2023-10-12 06:06:03,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=963414.6666666666, ans=0.125 2023-10-12 06:06:20,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=963461.3333333334, ans=0.125 2023-10-12 06:06:38,543 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-10-12 06:07:00,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=963648.0, ans=0.0 2023-10-12 06:07:01,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.27 vs. limit=22.5 2023-10-12 06:07:21,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=963741.3333333334, ans=0.125 2023-10-12 06:07:22,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=963741.3333333334, ans=0.125 2023-10-12 06:07:24,249 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.82 vs. limit=22.5 2023-10-12 06:07:24,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.668e+02 1.813e+02 2.065e+02 2.663e+02, threshold=3.626e+02, percent-clipped=0.0 2023-10-12 06:07:25,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.34 vs. limit=15.0 2023-10-12 06:07:28,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=963741.3333333334, ans=0.1 2023-10-12 06:07:45,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=963834.6666666666, ans=0.125 2023-10-12 06:07:52,741 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:07:56,890 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.83 vs. limit=22.5 2023-10-12 06:07:57,319 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:08:25,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=964021.3333333334, ans=0.0 2023-10-12 06:08:38,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=964068.0, ans=10.0 2023-10-12 06:08:43,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=964068.0, ans=0.2 2023-10-12 06:08:46,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964068.0, ans=0.1 2023-10-12 06:08:57,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=964114.6666666666, ans=0.95 2023-10-12 06:09:13,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=964161.3333333334, ans=0.2 2023-10-12 06:09:17,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.749e+02 1.885e+02 2.087e+02 2.638e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-12 06:09:19,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=964208.0, ans=0.125 2023-10-12 06:10:03,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=964394.6666666666, ans=0.125 2023-10-12 06:10:14,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=964441.3333333334, ans=0.5 2023-10-12 06:10:18,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=964441.3333333334, ans=0.1 2023-10-12 06:10:20,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=964488.0, ans=0.125 2023-10-12 06:10:25,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=964488.0, ans=0.0 2023-10-12 06:10:33,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=964534.6666666666, ans=0.125 2023-10-12 06:10:34,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964534.6666666666, ans=0.1 2023-10-12 06:10:39,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=964534.6666666666, ans=0.0 2023-10-12 06:10:45,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=964581.3333333334, ans=0.125 2023-10-12 06:11:00,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=964628.0, ans=0.0 2023-10-12 06:11:08,092 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.682e+02 1.824e+02 1.948e+02 2.360e+02, threshold=3.648e+02, percent-clipped=0.0 2023-10-12 06:11:24,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=964721.3333333334, ans=0.04949747468305833 2023-10-12 06:11:43,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964814.6666666666, ans=0.1 2023-10-12 06:11:50,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=964861.3333333334, ans=0.09899494936611666 2023-10-12 06:11:57,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964861.3333333334, ans=0.1 2023-10-12 06:12:03,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.14 vs. limit=22.5 2023-10-12 06:12:32,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=965001.3333333334, ans=0.1 2023-10-12 06:12:44,636 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=12.0 2023-10-12 06:12:51,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=15.0 2023-10-12 06:12:58,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=965094.6666666666, ans=0.125 2023-10-12 06:13:11,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.657e+02 1.830e+02 2.007e+02 2.715e+02, threshold=3.661e+02, percent-clipped=0.0 2023-10-12 06:13:13,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=965141.3333333334, ans=0.125 2023-10-12 06:13:17,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=965141.3333333334, ans=15.0 2023-10-12 06:13:22,115 INFO [train.py:1031] (0/4) Epoch 16, batch 2000, loss[loss=0.2024, simple_loss=0.2965, pruned_loss=0.05414, over 16795.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2858, pruned_loss=0.0525, over 20763798.85 frames. ], batch size: 175, lr: 2.21e-03, grad_scale: 32.0 2023-10-12 06:13:22,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=965188.0, ans=0.05 2023-10-12 06:13:24,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=965188.0, ans=0.0 2023-10-12 06:13:55,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=965281.3333333334, ans=0.125 2023-10-12 06:14:03,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=965328.0, ans=0.125 2023-10-12 06:14:13,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=965328.0, ans=15.0 2023-10-12 06:14:25,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-10-12 06:14:26,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=965421.3333333334, ans=0.0 2023-10-12 06:14:40,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=965468.0, ans=0.1 2023-10-12 06:14:57,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=965514.6666666666, ans=0.125 2023-10-12 06:15:06,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=15.0 2023-10-12 06:15:07,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.04 vs. limit=22.5 2023-10-12 06:15:13,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=965608.0, ans=0.125 2023-10-12 06:15:14,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=965608.0, ans=0.125 2023-10-12 06:15:14,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.732e+02 1.944e+02 2.279e+02 3.575e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-12 06:15:18,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=965608.0, ans=0.125 2023-10-12 06:16:15,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=965748.0, ans=0.2 2023-10-12 06:16:16,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=965748.0, ans=0.125 2023-10-12 06:16:16,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=965748.0, ans=0.07 2023-10-12 06:16:23,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=965794.6666666666, ans=0.0 2023-10-12 06:16:27,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=965794.6666666666, ans=0.125 2023-10-12 06:16:37,057 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:16:49,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=965888.0, ans=0.125 2023-10-12 06:16:57,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=965934.6666666666, ans=0.125 2023-10-12 06:17:06,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=965934.6666666666, ans=10.0 2023-10-12 06:17:11,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=965981.3333333334, ans=0.1 2023-10-12 06:17:16,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=965981.3333333334, ans=0.0 2023-10-12 06:17:37,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.813e+02 1.953e+02 2.190e+02 3.018e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-12 06:17:45,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=15.0 2023-10-12 06:18:09,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=966214.6666666666, ans=0.05 2023-10-12 06:18:17,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=966214.6666666666, ans=0.05 2023-10-12 06:18:18,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=966214.6666666666, ans=0.04949747468305833 2023-10-12 06:18:20,657 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.92 vs. limit=22.5 2023-10-12 06:18:24,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=966261.3333333334, ans=0.0 2023-10-12 06:18:46,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=966354.6666666666, ans=0.1 2023-10-12 06:18:58,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=966401.3333333334, ans=0.125 2023-10-12 06:19:12,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=966448.0, ans=0.125 2023-10-12 06:19:30,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.721e+02 1.921e+02 2.277e+02 2.970e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-12 06:19:35,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=966541.3333333334, ans=0.125 2023-10-12 06:19:39,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=966588.0, ans=0.2 2023-10-12 06:20:09,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=966681.3333333334, ans=0.0 2023-10-12 06:20:21,821 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:20:25,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=966774.6666666666, ans=0.1 2023-10-12 06:20:27,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=966774.6666666666, ans=0.125 2023-10-12 06:20:28,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=966774.6666666666, ans=0.125 2023-10-12 06:20:37,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=966821.3333333334, ans=0.0 2023-10-12 06:20:55,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=966868.0, ans=0.125 2023-10-12 06:21:20,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=967008.0, ans=0.125 2023-10-12 06:21:22,867 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.792e+02 1.915e+02 2.154e+02 3.344e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-12 06:21:25,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=967008.0, ans=0.125 2023-10-12 06:21:30,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=967054.6666666666, ans=0.05 2023-10-12 06:21:31,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=967054.6666666666, ans=0.125 2023-10-12 06:21:33,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=967054.6666666666, ans=0.125 2023-10-12 06:21:48,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=967101.3333333334, ans=0.125 2023-10-12 06:21:54,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=967148.0, ans=0.125 2023-10-12 06:21:54,783 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:21:59,048 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=22.5 2023-10-12 06:22:06,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=967194.6666666666, ans=0.0 2023-10-12 06:22:09,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=967194.6666666666, ans=0.125 2023-10-12 06:22:10,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=967194.6666666666, ans=0.125 2023-10-12 06:22:11,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=967194.6666666666, ans=0.1 2023-10-12 06:22:20,304 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.78 vs. limit=10.0 2023-10-12 06:22:44,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=967334.6666666666, ans=0.0 2023-10-12 06:22:46,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=967381.3333333334, ans=0.1 2023-10-12 06:23:04,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=967428.0, ans=0.125 2023-10-12 06:23:12,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.743e+02 1.963e+02 2.200e+02 2.708e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-12 06:23:12,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=967474.6666666666, ans=0.125 2023-10-12 06:23:16,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=967474.6666666666, ans=0.125 2023-10-12 06:23:18,919 INFO [train.py:1031] (0/4) Epoch 16, batch 2500, loss[loss=0.1991, simple_loss=0.2928, pruned_loss=0.0527, over 16915.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2862, pruned_loss=0.05287, over 23430358.15 frames. ], batch size: 138, lr: 2.21e-03, grad_scale: 32.0 2023-10-12 06:23:22,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=967521.3333333334, ans=0.1 2023-10-12 06:23:27,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=967521.3333333334, ans=0.125 2023-10-12 06:23:37,618 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.04 vs. limit=22.5 2023-10-12 06:23:39,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=967614.6666666666, ans=0.05 2023-10-12 06:24:03,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=967708.0, ans=15.0 2023-10-12 06:24:23,947 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:24:33,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=967848.0, ans=0.035 2023-10-12 06:24:39,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=967848.0, ans=0.025 2023-10-12 06:24:40,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=967848.0, ans=0.125 2023-10-12 06:24:50,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=967894.6666666666, ans=0.125 2023-10-12 06:24:58,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.731e+02 1.875e+02 2.109e+02 3.032e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-12 06:25:04,039 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-10-12 06:25:11,082 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-10-12 06:25:19,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=968034.6666666666, ans=0.125 2023-10-12 06:25:27,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=968034.6666666666, ans=0.0 2023-10-12 06:25:58,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=968174.6666666666, ans=0.125 2023-10-12 06:26:00,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=968174.6666666666, ans=0.0 2023-10-12 06:26:03,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=968221.3333333334, ans=0.1 2023-10-12 06:26:08,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=968221.3333333334, ans=0.1 2023-10-12 06:26:08,756 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.62 vs. limit=22.5 2023-10-12 06:26:12,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=968221.3333333334, ans=0.0 2023-10-12 06:26:28,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=968314.6666666666, ans=0.125 2023-10-12 06:26:43,912 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.74 vs. limit=12.0 2023-10-12 06:26:49,883 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.667e+02 1.847e+02 2.043e+02 3.019e+02, threshold=3.694e+02, percent-clipped=0.0 2023-10-12 06:27:21,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=968501.3333333334, ans=0.04949747468305833 2023-10-12 06:27:46,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=968641.3333333334, ans=0.125 2023-10-12 06:28:07,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=968688.0, ans=0.2 2023-10-12 06:28:21,482 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=22.5 2023-10-12 06:28:29,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=968781.3333333334, ans=0.125 2023-10-12 06:28:33,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=968781.3333333334, ans=0.125 2023-10-12 06:28:38,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=968828.0, ans=0.125 2023-10-12 06:28:54,371 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.664e+02 1.825e+02 2.112e+02 4.311e+02, threshold=3.651e+02, percent-clipped=1.0 2023-10-12 06:29:03,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=968921.3333333334, ans=0.1 2023-10-12 06:29:46,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=969061.3333333334, ans=0.07 2023-10-12 06:29:51,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.23 vs. limit=22.5 2023-10-12 06:29:53,717 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-10-12 06:30:04,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=969154.6666666666, ans=0.0 2023-10-12 06:30:25,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=969201.3333333334, ans=0.0 2023-10-12 06:30:31,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=969248.0, ans=0.5 2023-10-12 06:31:01,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.677e+02 1.872e+02 2.144e+02 2.805e+02, threshold=3.743e+02, percent-clipped=0.0 2023-10-12 06:31:05,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=969341.3333333334, ans=0.0 2023-10-12 06:31:17,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=969388.0, ans=0.125 2023-10-12 06:31:19,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=969434.6666666666, ans=0.04949747468305833 2023-10-12 06:31:25,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=969434.6666666666, ans=0.1 2023-10-12 06:31:29,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=969434.6666666666, ans=0.0 2023-10-12 06:32:05,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=969621.3333333334, ans=0.0 2023-10-12 06:32:07,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=969621.3333333334, ans=0.2 2023-10-12 06:32:15,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=969668.0, ans=0.125 2023-10-12 06:32:33,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.97 vs. limit=22.5 2023-10-12 06:32:41,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=969761.3333333334, ans=0.125 2023-10-12 06:32:52,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.342e+02 1.654e+02 1.837e+02 2.020e+02 2.615e+02, threshold=3.675e+02, percent-clipped=0.0 2023-10-12 06:32:56,490 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:32:59,753 INFO [train.py:1031] (0/4) Epoch 16, batch 3000, loss[loss=0.1984, simple_loss=0.2851, pruned_loss=0.05589, over 16885.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2853, pruned_loss=0.05274, over 25493193.72 frames. ], batch size: 72, lr: 2.20e-03, grad_scale: 32.0 2023-10-12 06:33:14,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=969901.3333333334, ans=0.1 2023-10-12 06:33:19,986 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:33:33,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=969994.6666666666, ans=0.0 2023-10-12 06:33:37,545 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.55 vs. limit=15.0 2023-10-12 06:33:50,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=970041.3333333334, ans=0.0 2023-10-12 06:33:50,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=970041.3333333334, ans=0.125 2023-10-12 06:34:02,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=970088.0, ans=0.125 2023-10-12 06:34:22,284 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-10-12 06:34:24,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=970181.3333333334, ans=0.125 2023-10-12 06:34:42,319 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.781e+02 1.934e+02 2.170e+02 3.723e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-12 06:34:55,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=970321.3333333334, ans=0.125 2023-10-12 06:35:16,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=970414.6666666666, ans=0.0 2023-10-12 06:35:25,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.34 vs. limit=22.5 2023-10-12 06:35:25,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=970461.3333333334, ans=0.125 2023-10-12 06:36:01,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=970601.3333333334, ans=0.125 2023-10-12 06:36:12,859 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-208000.pt 2023-10-12 06:36:32,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=970694.6666666666, ans=0.125 2023-10-12 06:36:33,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=970694.6666666666, ans=0.05 2023-10-12 06:36:38,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=970741.3333333334, ans=0.125 2023-10-12 06:36:41,883 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.711e+02 1.913e+02 2.244e+02 3.214e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-12 06:36:47,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=970788.0, ans=0.125 2023-10-12 06:36:58,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=970834.6666666666, ans=0.125 2023-10-12 06:37:14,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=970881.3333333334, ans=0.125 2023-10-12 06:37:49,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=970974.6666666666, ans=0.0 2023-10-12 06:37:53,414 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.69 vs. limit=6.0 2023-10-12 06:37:56,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=971021.3333333334, ans=0.1 2023-10-12 06:38:28,957 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.36 vs. limit=15.0 2023-10-12 06:38:32,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=971161.3333333334, ans=0.125 2023-10-12 06:38:36,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=971161.3333333334, ans=0.125 2023-10-12 06:38:36,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=971161.3333333334, ans=0.2 2023-10-12 06:38:42,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=971208.0, ans=0.0 2023-10-12 06:38:46,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.730e+02 1.929e+02 2.214e+02 3.129e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-12 06:39:32,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=971394.6666666666, ans=0.125 2023-10-12 06:39:32,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=971394.6666666666, ans=0.125 2023-10-12 06:39:38,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=971441.3333333334, ans=0.125 2023-10-12 06:39:39,083 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-10-12 06:39:58,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=971534.6666666666, ans=0.0 2023-10-12 06:40:11,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=971581.3333333334, ans=0.0 2023-10-12 06:40:34,496 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-10-12 06:40:39,815 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.766e+02 1.927e+02 2.174e+02 2.934e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-12 06:40:44,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=971674.6666666666, ans=0.1 2023-10-12 06:40:46,671 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-10-12 06:40:47,626 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.38 vs. limit=22.5 2023-10-12 06:40:48,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=971721.3333333334, ans=0.1 2023-10-12 06:41:12,347 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.42 vs. limit=10.0 2023-10-12 06:41:45,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=971954.6666666666, ans=0.0 2023-10-12 06:42:32,542 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.787e+02 1.917e+02 2.193e+02 3.191e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-12 06:42:38,487 INFO [train.py:1031] (0/4) Epoch 16, batch 3500, loss[loss=0.1821, simple_loss=0.279, pruned_loss=0.04261, over 16780.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.285, pruned_loss=0.05275, over 27067353.76 frames. ], batch size: 98, lr: 2.20e-03, grad_scale: 32.0 2023-10-12 06:42:41,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=972188.0, ans=0.125 2023-10-12 06:42:42,269 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:42:42,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=972188.0, ans=0.0 2023-10-12 06:42:45,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2.whitening_limit, batch_count=972188.0, ans=15.0 2023-10-12 06:43:06,249 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.26 vs. limit=15.0 2023-10-12 06:43:33,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=972374.6666666666, ans=0.0 2023-10-12 06:43:36,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=972421.3333333334, ans=0.2 2023-10-12 06:44:13,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=972514.6666666666, ans=0.0 2023-10-12 06:44:16,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.86 vs. limit=22.5 2023-10-12 06:44:33,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.768e+02 1.961e+02 2.249e+02 2.812e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-12 06:44:39,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972654.6666666666, ans=0.1 2023-10-12 06:44:44,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=972654.6666666666, ans=0.125 2023-10-12 06:44:59,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=972701.3333333334, ans=0.2 2023-10-12 06:45:07,340 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:45:35,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=972841.3333333334, ans=0.2 2023-10-12 06:45:45,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=972888.0, ans=0.0 2023-10-12 06:45:54,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=972934.6666666666, ans=0.0 2023-10-12 06:45:56,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=972934.6666666666, ans=0.0 2023-10-12 06:46:02,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=972934.6666666666, ans=0.2 2023-10-12 06:46:10,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=972981.3333333334, ans=0.2 2023-10-12 06:46:16,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.74 vs. limit=10.0 2023-10-12 06:46:32,503 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.657e+02 1.870e+02 2.081e+02 2.778e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-12 06:46:33,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973074.6666666666, ans=0.1 2023-10-12 06:47:08,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.87 vs. limit=10.0 2023-10-12 06:47:40,699 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.40 vs. limit=10.0 2023-10-12 06:47:48,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=973354.6666666666, ans=0.125 2023-10-12 06:47:54,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=973401.3333333334, ans=0.125 2023-10-12 06:47:54,292 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.78 vs. limit=10.0 2023-10-12 06:48:03,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=973448.0, ans=0.05 2023-10-12 06:48:32,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.704e+02 1.878e+02 2.031e+02 2.594e+02, threshold=3.755e+02, percent-clipped=0.0 2023-10-12 06:48:41,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=973588.0, ans=0.2 2023-10-12 06:48:52,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=973634.6666666666, ans=22.5 2023-10-12 06:49:08,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=973681.3333333334, ans=0.125 2023-10-12 06:49:19,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.14 vs. limit=15.0 2023-10-12 06:49:20,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=973728.0, ans=0.125 2023-10-12 06:49:32,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=973821.3333333334, ans=0.0 2023-10-12 06:49:36,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=973821.3333333334, ans=0.0 2023-10-12 06:49:59,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=973914.6666666666, ans=0.0 2023-10-12 06:50:21,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=974008.0, ans=0.04949747468305833 2023-10-12 06:50:22,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.700e+02 1.872e+02 2.052e+02 2.548e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-12 06:50:23,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=15.0 2023-10-12 06:50:40,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=974101.3333333334, ans=0.125 2023-10-12 06:50:57,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=974148.0, ans=0.1 2023-10-12 06:51:17,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=974241.3333333334, ans=0.125 2023-10-12 06:51:23,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=974288.0, ans=0.1 2023-10-12 06:52:15,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.715e+02 1.916e+02 2.249e+02 3.114e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-12 06:52:15,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=974474.6666666666, ans=0.125 2023-10-12 06:52:17,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=974474.6666666666, ans=0.2 2023-10-12 06:52:20,169 INFO [train.py:1031] (0/4) Epoch 16, batch 4000, loss[loss=0.2118, simple_loss=0.3064, pruned_loss=0.05854, over 16603.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2847, pruned_loss=0.05268, over 28326394.76 frames. ], batch size: 241, lr: 2.20e-03, grad_scale: 16.0 2023-10-12 06:52:44,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=974614.6666666666, ans=0.0 2023-10-12 06:52:50,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=974614.6666666666, ans=0.0 2023-10-12 06:52:54,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=974614.6666666666, ans=0.0 2023-10-12 06:52:54,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.17 vs. limit=15.0 2023-10-12 06:52:56,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=974661.3333333334, ans=0.125 2023-10-12 06:52:59,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=974661.3333333334, ans=0.125 2023-10-12 06:53:00,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=974661.3333333334, ans=0.2 2023-10-12 06:53:01,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=974661.3333333334, ans=0.2 2023-10-12 06:53:01,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=974661.3333333334, ans=0.1 2023-10-12 06:53:10,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=974708.0, ans=0.0 2023-10-12 06:53:12,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-10-12 06:53:37,864 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.01 vs. limit=10.0 2023-10-12 06:53:50,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=974894.6666666666, ans=15.0 2023-10-12 06:54:07,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=974941.3333333334, ans=0.5 2023-10-12 06:54:07,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=974941.3333333334, ans=0.125 2023-10-12 06:54:09,434 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.716e+02 1.855e+02 2.048e+02 2.597e+02, threshold=3.709e+02, percent-clipped=0.0 2023-10-12 06:54:15,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=974988.0, ans=0.5 2023-10-12 06:54:22,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=974988.0, ans=0.0 2023-10-12 06:54:42,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=975081.3333333334, ans=0.07 2023-10-12 06:54:56,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=975128.0, ans=0.2 2023-10-12 06:54:57,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=975128.0, ans=0.125 2023-10-12 06:55:28,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=975268.0, ans=0.125 2023-10-12 06:55:29,536 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 06:56:06,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=975361.3333333334, ans=0.125 2023-10-12 06:56:12,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=975408.0, ans=0.125 2023-10-12 06:56:17,760 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.717e+02 1.916e+02 2.111e+02 3.131e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-12 06:56:27,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=975454.6666666666, ans=0.0 2023-10-12 06:56:38,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=975501.3333333334, ans=0.2 2023-10-12 06:56:46,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=975548.0, ans=0.0 2023-10-12 06:56:57,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.01 vs. limit=10.0 2023-10-12 06:57:13,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=975641.3333333334, ans=0.125 2023-10-12 06:57:22,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=975688.0, ans=0.0 2023-10-12 06:57:34,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=975734.6666666666, ans=0.125 2023-10-12 06:57:45,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=975781.3333333334, ans=0.125 2023-10-12 06:58:00,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=975828.0, ans=0.125 2023-10-12 06:58:03,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=975828.0, ans=0.2 2023-10-12 06:58:13,425 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.677e+02 1.919e+02 2.254e+02 3.431e+02, threshold=3.837e+02, percent-clipped=0.0 2023-10-12 06:58:32,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=975968.0, ans=0.2 2023-10-12 06:58:35,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=975968.0, ans=0.2 2023-10-12 06:58:36,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=975968.0, ans=0.125 2023-10-12 06:58:37,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=975968.0, ans=0.0 2023-10-12 06:58:39,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=976014.6666666666, ans=0.2 2023-10-12 06:58:46,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=976014.6666666666, ans=15.0 2023-10-12 06:58:57,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=976061.3333333334, ans=0.125 2023-10-12 06:58:58,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=976061.3333333334, ans=0.0 2023-10-12 06:59:22,441 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.86 vs. limit=15.0 2023-10-12 07:00:06,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.788e+02 1.970e+02 2.182e+02 3.243e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-12 07:00:06,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=976341.3333333334, ans=0.125 2023-10-12 07:00:15,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=976388.0, ans=0.125 2023-10-12 07:00:17,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=976388.0, ans=0.125 2023-10-12 07:00:27,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=976434.6666666666, ans=0.125 2023-10-12 07:00:28,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=976434.6666666666, ans=0.125 2023-10-12 07:01:20,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=976621.3333333334, ans=0.125 2023-10-12 07:01:31,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=976668.0, ans=0.125 2023-10-12 07:01:35,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=976668.0, ans=0.05 2023-10-12 07:01:36,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=15.0 2023-10-12 07:01:58,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=976761.3333333334, ans=0.5 2023-10-12 07:02:01,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=976761.3333333334, ans=0.125 2023-10-12 07:02:02,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.80 vs. limit=6.0 2023-10-12 07:02:10,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=976808.0, ans=0.125 2023-10-12 07:02:11,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.766e+02 1.909e+02 2.117e+02 3.307e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-12 07:02:15,891 INFO [train.py:1031] (0/4) Epoch 16, batch 4500, loss[loss=0.1725, simple_loss=0.2687, pruned_loss=0.03815, over 16816.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.285, pruned_loss=0.05242, over 29320966.44 frames. ], batch size: 175, lr: 2.20e-03, grad_scale: 32.0 2023-10-12 07:02:35,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=976901.3333333334, ans=0.1 2023-10-12 07:02:43,856 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:02:46,147 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.49 vs. limit=10.0 2023-10-12 07:02:56,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=976994.6666666666, ans=0.125 2023-10-12 07:03:02,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=977041.3333333334, ans=0.125 2023-10-12 07:03:10,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=977088.0, ans=0.125 2023-10-12 07:03:18,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-10-12 07:03:23,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.70 vs. limit=22.5 2023-10-12 07:03:27,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=977134.6666666666, ans=0.1 2023-10-12 07:03:30,994 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:03:43,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=977228.0, ans=0.2 2023-10-12 07:03:44,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=977228.0, ans=0.2 2023-10-12 07:03:50,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=977228.0, ans=0.125 2023-10-12 07:03:59,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.735e+02 1.870e+02 2.144e+02 2.941e+02, threshold=3.740e+02, percent-clipped=0.0 2023-10-12 07:04:16,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=977368.0, ans=0.125 2023-10-12 07:04:17,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.19 vs. limit=15.0 2023-10-12 07:04:28,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=977414.6666666666, ans=0.0 2023-10-12 07:04:50,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=977508.0, ans=0.125 2023-10-12 07:04:57,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=977554.6666666666, ans=0.125 2023-10-12 07:05:04,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=977554.6666666666, ans=0.125 2023-10-12 07:05:11,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=977601.3333333334, ans=0.0 2023-10-12 07:05:19,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=977648.0, ans=0.07 2023-10-12 07:05:26,419 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.95 vs. limit=15.0 2023-10-12 07:05:38,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=977694.6666666666, ans=0.125 2023-10-12 07:05:43,708 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:05:44,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=977741.3333333334, ans=0.125 2023-10-12 07:05:46,138 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-10-12 07:05:49,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.765e+02 1.919e+02 2.141e+02 2.637e+02, threshold=3.839e+02, percent-clipped=0.0 2023-10-12 07:06:04,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=977834.6666666666, ans=0.1 2023-10-12 07:06:23,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=977881.3333333334, ans=0.0 2023-10-12 07:06:52,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=978021.3333333334, ans=0.2 2023-10-12 07:06:53,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=978021.3333333334, ans=0.0 2023-10-12 07:06:59,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.13 vs. limit=15.0 2023-10-12 07:07:10,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=978114.6666666666, ans=0.125 2023-10-12 07:07:21,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=978161.3333333334, ans=0.0 2023-10-12 07:07:27,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=978161.3333333334, ans=0.1 2023-10-12 07:07:34,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=978208.0, ans=0.1 2023-10-12 07:07:36,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.648e+02 1.804e+02 1.961e+02 2.801e+02, threshold=3.608e+02, percent-clipped=0.0 2023-10-12 07:07:53,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=978301.3333333334, ans=0.2 2023-10-12 07:08:23,801 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=12.0 2023-10-12 07:08:24,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=978394.6666666666, ans=0.0 2023-10-12 07:08:47,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=978488.0, ans=0.125 2023-10-12 07:09:07,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=978581.3333333334, ans=0.125 2023-10-12 07:09:08,191 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.03 vs. limit=22.5 2023-10-12 07:09:33,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.738e+02 1.935e+02 2.175e+02 2.793e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-12 07:09:35,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=978674.6666666666, ans=0.125 2023-10-12 07:10:09,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=978814.6666666666, ans=0.2 2023-10-12 07:10:11,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=978861.3333333334, ans=0.1 2023-10-12 07:10:28,212 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=22.5 2023-10-12 07:10:33,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=978908.0, ans=0.1 2023-10-12 07:10:38,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=978954.6666666666, ans=0.125 2023-10-12 07:10:51,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-10-12 07:11:07,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=979048.0, ans=0.2 2023-10-12 07:11:08,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=979048.0, ans=0.125 2023-10-12 07:11:08,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=979048.0, ans=0.2 2023-10-12 07:11:18,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=979094.6666666666, ans=0.0 2023-10-12 07:11:30,063 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.758e+02 1.958e+02 2.264e+02 3.264e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-12 07:11:32,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=979188.0, ans=0.125 2023-10-12 07:11:33,518 INFO [train.py:1031] (0/4) Epoch 16, batch 5000, loss[loss=0.2219, simple_loss=0.2913, pruned_loss=0.07628, over 15679.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2846, pruned_loss=0.05233, over 30113766.28 frames. ], batch size: 350, lr: 2.19e-03, grad_scale: 16.0 2023-10-12 07:11:38,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-10-12 07:11:49,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=979234.6666666666, ans=0.125 2023-10-12 07:12:02,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=979281.3333333334, ans=0.125 2023-10-12 07:12:18,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=979374.6666666666, ans=0.0 2023-10-12 07:12:19,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=979374.6666666666, ans=0.09899494936611666 2023-10-12 07:12:20,404 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.67 vs. limit=15.0 2023-10-12 07:12:24,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=979374.6666666666, ans=0.0 2023-10-12 07:12:35,014 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.78 vs. limit=10.0 2023-10-12 07:12:56,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=979514.6666666666, ans=0.07 2023-10-12 07:13:06,406 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-10-12 07:13:07,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=979561.3333333334, ans=0.5 2023-10-12 07:13:16,574 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.014e-03 2023-10-12 07:13:22,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979608.0, ans=0.1 2023-10-12 07:13:22,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.729e+02 1.890e+02 2.064e+02 2.683e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-12 07:13:23,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=979608.0, ans=0.125 2023-10-12 07:13:44,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=979701.3333333334, ans=0.0 2023-10-12 07:13:45,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=979701.3333333334, ans=0.0 2023-10-12 07:13:46,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=979701.3333333334, ans=0.125 2023-10-12 07:13:51,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=979748.0, ans=0.125 2023-10-12 07:13:56,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=979748.0, ans=0.125 2023-10-12 07:14:10,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=979794.6666666666, ans=0.125 2023-10-12 07:14:13,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.61 vs. limit=5.0 2023-10-12 07:14:14,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=979841.3333333334, ans=0.125 2023-10-12 07:15:03,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=980028.0, ans=0.125 2023-10-12 07:15:06,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=980074.6666666666, ans=0.125 2023-10-12 07:15:09,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=980074.6666666666, ans=0.0 2023-10-12 07:15:14,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.739e+02 1.966e+02 2.203e+02 3.388e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-12 07:15:18,245 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-12 07:15:48,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=980261.3333333334, ans=0.125 2023-10-12 07:16:29,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-10-12 07:16:41,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980448.0, ans=0.1 2023-10-12 07:17:09,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.694e+02 1.865e+02 2.017e+02 2.692e+02, threshold=3.730e+02, percent-clipped=0.0 2023-10-12 07:17:13,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=980588.0, ans=0.125 2023-10-12 07:17:21,718 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.43 vs. limit=22.5 2023-10-12 07:17:39,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=980681.3333333334, ans=0.0 2023-10-12 07:17:52,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=980728.0, ans=0.2 2023-10-12 07:17:55,561 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:18:24,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=980821.3333333334, ans=0.0 2023-10-12 07:18:27,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=980821.3333333334, ans=0.125 2023-10-12 07:18:37,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=980868.0, ans=0.125 2023-10-12 07:18:38,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=980868.0, ans=0.0 2023-10-12 07:18:41,675 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:18:54,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=980961.3333333334, ans=0.125 2023-10-12 07:19:07,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=981008.0, ans=0.1 2023-10-12 07:19:09,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.630e+02 1.826e+02 2.011e+02 3.198e+02, threshold=3.651e+02, percent-clipped=0.0 2023-10-12 07:19:10,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=981008.0, ans=0.125 2023-10-12 07:19:14,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=981054.6666666666, ans=0.125 2023-10-12 07:19:23,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=981101.3333333334, ans=0.125 2023-10-12 07:19:46,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=981194.6666666666, ans=0.125 2023-10-12 07:20:05,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=981241.3333333334, ans=0.125 2023-10-12 07:20:08,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2023-10-12 07:20:12,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=981288.0, ans=0.125 2023-10-12 07:20:57,870 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-12 07:20:58,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.703e+02 1.975e+02 2.252e+02 2.999e+02, threshold=3.949e+02, percent-clipped=0.0 2023-10-12 07:21:01,202 INFO [train.py:1031] (0/4) Epoch 16, batch 5500, loss[loss=0.1746, simple_loss=0.2629, pruned_loss=0.04319, over 16522.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2844, pruned_loss=0.05222, over 30702182.82 frames. ], batch size: 61, lr: 2.19e-03, grad_scale: 32.0 2023-10-12 07:21:06,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=981521.3333333334, ans=0.05 2023-10-12 07:21:11,082 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-10-12 07:21:20,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=981568.0, ans=0.1 2023-10-12 07:22:09,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=981801.3333333334, ans=0.95 2023-10-12 07:22:12,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=981801.3333333334, ans=0.125 2023-10-12 07:22:12,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=981801.3333333334, ans=0.0 2023-10-12 07:22:14,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.59 vs. limit=6.0 2023-10-12 07:22:18,771 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.00 vs. limit=22.5 2023-10-12 07:22:41,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=981941.3333333334, ans=0.0 2023-10-12 07:22:44,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=981941.3333333334, ans=0.5 2023-10-12 07:22:46,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.720e+02 1.853e+02 2.009e+02 2.643e+02, threshold=3.706e+02, percent-clipped=0.0 2023-10-12 07:23:03,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=982034.6666666666, ans=0.125 2023-10-12 07:23:17,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.17 vs. limit=22.5 2023-10-12 07:23:32,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=982128.0, ans=0.125 2023-10-12 07:23:36,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-10-12 07:23:47,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=982221.3333333334, ans=0.0 2023-10-12 07:24:22,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=982361.3333333334, ans=22.5 2023-10-12 07:24:40,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.747e+02 1.910e+02 2.094e+02 2.874e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-12 07:24:42,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=982408.0, ans=0.025 2023-10-12 07:24:54,964 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.69 vs. limit=15.0 2023-10-12 07:25:16,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=982548.0, ans=0.0 2023-10-12 07:25:17,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=982548.0, ans=0.125 2023-10-12 07:25:21,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=982594.6666666666, ans=0.125 2023-10-12 07:25:33,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=982641.3333333334, ans=0.1 2023-10-12 07:25:54,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=982734.6666666666, ans=0.0 2023-10-12 07:25:56,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=982734.6666666666, ans=0.125 2023-10-12 07:25:57,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=982734.6666666666, ans=0.2 2023-10-12 07:25:57,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=982734.6666666666, ans=0.125 2023-10-12 07:26:03,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=982781.3333333334, ans=0.0 2023-10-12 07:26:32,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.76 vs. limit=15.0 2023-10-12 07:26:33,530 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.738e+02 1.888e+02 2.110e+02 2.709e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-12 07:26:36,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-10-12 07:26:57,487 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.36 vs. limit=10.0 2023-10-12 07:27:31,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=983154.6666666666, ans=0.2 2023-10-12 07:27:32,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=983154.6666666666, ans=0.125 2023-10-12 07:27:49,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=983201.3333333334, ans=0.0 2023-10-12 07:28:21,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=983341.3333333334, ans=0.1 2023-10-12 07:28:27,326 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.645e+02 1.784e+02 2.008e+02 2.983e+02, threshold=3.567e+02, percent-clipped=0.0 2023-10-12 07:28:49,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=983434.6666666666, ans=0.125 2023-10-12 07:28:51,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=983434.6666666666, ans=0.125 2023-10-12 07:28:57,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=983481.3333333334, ans=0.2 2023-10-12 07:29:14,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=983528.0, ans=0.125 2023-10-12 07:29:23,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983574.6666666666, ans=0.1 2023-10-12 07:29:29,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=983621.3333333334, ans=0.07 2023-10-12 07:29:41,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=983668.0, ans=0.125 2023-10-12 07:29:54,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=983714.6666666666, ans=0.0 2023-10-12 07:29:56,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=983714.6666666666, ans=0.05 2023-10-12 07:29:59,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=983761.3333333334, ans=0.0 2023-10-12 07:30:09,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=983808.0, ans=0.0 2023-10-12 07:30:17,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.726e+02 1.903e+02 2.071e+02 2.794e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-12 07:30:20,428 INFO [train.py:1031] (0/4) Epoch 16, batch 6000, loss[loss=0.193, simple_loss=0.2803, pruned_loss=0.05283, over 16939.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2849, pruned_loss=0.05248, over 31200637.02 frames. ], batch size: 82, lr: 2.19e-03, grad_scale: 32.0 2023-10-12 07:30:41,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=983901.3333333334, ans=0.0 2023-10-12 07:30:48,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=983948.0, ans=0.04949747468305833 2023-10-12 07:30:48,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=983948.0, ans=0.125 2023-10-12 07:30:52,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=983948.0, ans=0.0 2023-10-12 07:30:55,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=983994.6666666666, ans=0.0 2023-10-12 07:30:56,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=983994.6666666666, ans=0.09899494936611666 2023-10-12 07:31:04,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=984041.3333333334, ans=0.0 2023-10-12 07:31:11,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=984041.3333333334, ans=0.1 2023-10-12 07:31:11,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=984041.3333333334, ans=0.0 2023-10-12 07:31:33,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=984134.6666666666, ans=0.2 2023-10-12 07:31:35,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=984134.6666666666, ans=0.2 2023-10-12 07:31:41,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=984181.3333333334, ans=0.95 2023-10-12 07:31:43,216 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=15.0 2023-10-12 07:32:08,782 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.708e+02 1.869e+02 2.137e+02 3.448e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-12 07:32:12,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=984321.3333333334, ans=0.1 2023-10-12 07:32:17,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=984321.3333333334, ans=0.125 2023-10-12 07:32:24,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=984368.0, ans=0.125 2023-10-12 07:32:43,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=984461.3333333334, ans=22.5 2023-10-12 07:33:16,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=984554.6666666666, ans=0.07 2023-10-12 07:33:46,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=984694.6666666666, ans=0.125 2023-10-12 07:33:58,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=984741.3333333334, ans=0.125 2023-10-12 07:34:00,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.786e+02 1.973e+02 2.241e+02 3.528e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-12 07:34:02,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=984741.3333333334, ans=0.1 2023-10-12 07:34:30,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=984881.3333333334, ans=0.0 2023-10-12 07:34:30,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=984881.3333333334, ans=0.125 2023-10-12 07:34:45,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.21 vs. limit=15.0 2023-10-12 07:34:50,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.99 vs. limit=10.0 2023-10-12 07:35:11,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=985068.0, ans=0.125 2023-10-12 07:35:39,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=985161.3333333334, ans=0.125 2023-10-12 07:35:51,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.684e+02 1.837e+02 2.044e+02 2.542e+02, threshold=3.674e+02, percent-clipped=0.0 2023-10-12 07:35:52,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=985208.0, ans=0.1 2023-10-12 07:36:11,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=985301.3333333334, ans=0.07 2023-10-12 07:36:19,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=985348.0, ans=0.5 2023-10-12 07:36:46,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=985441.3333333334, ans=0.125 2023-10-12 07:37:21,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=985581.3333333334, ans=0.2 2023-10-12 07:37:27,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=985581.3333333334, ans=0.1 2023-10-12 07:37:31,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=985581.3333333334, ans=0.125 2023-10-12 07:37:36,349 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:37:47,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.46 vs. limit=15.0 2023-10-12 07:37:49,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-10-12 07:37:54,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.723e+02 1.889e+02 2.137e+02 2.895e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-12 07:38:38,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.39 vs. limit=22.5 2023-10-12 07:38:54,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=985954.6666666666, ans=0.125 2023-10-12 07:39:19,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.63 vs. limit=15.0 2023-10-12 07:39:21,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=986048.0, ans=0.0 2023-10-12 07:39:28,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=986094.6666666666, ans=0.125 2023-10-12 07:39:48,148 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.337e+02 1.731e+02 1.884e+02 2.079e+02 3.368e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-12 07:39:48,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=986141.3333333334, ans=0.125 2023-10-12 07:39:50,155 INFO [train.py:1031] (0/4) Epoch 16, batch 6500, loss[loss=0.2067, simple_loss=0.3073, pruned_loss=0.05308, over 16834.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2854, pruned_loss=0.05261, over 31571642.79 frames. ], batch size: 175, lr: 2.19e-03, grad_scale: 32.0 2023-10-12 07:39:51,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=986188.0, ans=0.0 2023-10-12 07:39:59,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=986188.0, ans=0.1 2023-10-12 07:40:33,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-10-12 07:40:36,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=986328.0, ans=0.125 2023-10-12 07:40:49,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=986374.6666666666, ans=0.0 2023-10-12 07:41:13,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=986468.0, ans=0.1 2023-10-12 07:41:14,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=986468.0, ans=0.05 2023-10-12 07:41:19,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-10-12 07:41:33,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=986561.3333333334, ans=0.0 2023-10-12 07:41:43,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=986608.0, ans=0.0 2023-10-12 07:41:48,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=986608.0, ans=0.125 2023-10-12 07:41:52,029 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.706e+02 1.913e+02 2.116e+02 2.987e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-12 07:41:53,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=986654.6666666666, ans=0.0 2023-10-12 07:41:56,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=986654.6666666666, ans=0.2 2023-10-12 07:42:03,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=986654.6666666666, ans=0.1 2023-10-12 07:42:07,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=986701.3333333334, ans=0.2 2023-10-12 07:42:12,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=986701.3333333334, ans=0.125 2023-10-12 07:42:12,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=986701.3333333334, ans=0.0 2023-10-12 07:42:22,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=986748.0, ans=0.0 2023-10-12 07:42:31,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=986794.6666666666, ans=0.0 2023-10-12 07:42:36,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.83 vs. limit=15.0 2023-10-12 07:42:53,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=986888.0, ans=0.125 2023-10-12 07:42:59,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=986934.6666666666, ans=0.0 2023-10-12 07:43:07,735 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.41 vs. limit=15.0 2023-10-12 07:43:09,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=986934.6666666666, ans=0.0 2023-10-12 07:43:17,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.93 vs. limit=15.0 2023-10-12 07:43:32,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=987074.6666666666, ans=0.1 2023-10-12 07:43:40,092 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.729e+02 1.905e+02 2.192e+02 3.140e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 07:43:40,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=987074.6666666666, ans=0.0 2023-10-12 07:43:41,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=987121.3333333334, ans=0.125 2023-10-12 07:43:52,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=987168.0, ans=0.0 2023-10-12 07:43:52,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.41 vs. limit=10.0 2023-10-12 07:43:53,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=987168.0, ans=0.125 2023-10-12 07:44:01,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=987168.0, ans=0.2 2023-10-12 07:44:02,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=987168.0, ans=0.2 2023-10-12 07:44:34,471 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:44:36,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=987308.0, ans=0.0 2023-10-12 07:44:54,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=987401.3333333334, ans=0.1 2023-10-12 07:45:05,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.41 vs. limit=15.0 2023-10-12 07:45:12,484 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:45:25,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-10-12 07:45:26,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=987541.3333333334, ans=0.125 2023-10-12 07:45:36,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.659e+02 1.803e+02 2.101e+02 3.161e+02, threshold=3.606e+02, percent-clipped=0.0 2023-10-12 07:46:14,520 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:46:14,663 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.39 vs. limit=22.5 2023-10-12 07:46:50,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.60 vs. limit=6.0 2023-10-12 07:47:14,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=987914.6666666666, ans=0.1 2023-10-12 07:47:17,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=987914.6666666666, ans=0.1 2023-10-12 07:47:23,335 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 07:47:27,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=987961.3333333334, ans=0.2 2023-10-12 07:47:37,934 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-10-12 07:47:44,715 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.634e+02 1.796e+02 2.026e+02 3.215e+02, threshold=3.593e+02, percent-clipped=0.0 2023-10-12 07:47:50,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=988054.6666666666, ans=0.09899494936611666 2023-10-12 07:47:57,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.86 vs. limit=15.0 2023-10-12 07:48:05,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.56 vs. limit=6.0 2023-10-12 07:48:11,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.25 vs. limit=15.0 2023-10-12 07:48:22,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=988194.6666666666, ans=0.0 2023-10-12 07:48:23,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=988194.6666666666, ans=0.125 2023-10-12 07:48:45,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=988288.0, ans=0.0 2023-10-12 07:48:46,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=988288.0, ans=0.125 2023-10-12 07:48:50,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=988334.6666666666, ans=0.125 2023-10-12 07:48:53,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=988334.6666666666, ans=0.1 2023-10-12 07:49:05,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=988381.3333333334, ans=0.07 2023-10-12 07:49:06,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=988381.3333333334, ans=0.125 2023-10-12 07:49:31,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.788e+02 2.022e+02 2.239e+02 2.905e+02, threshold=4.044e+02, percent-clipped=0.0 2023-10-12 07:49:32,739 INFO [train.py:1031] (0/4) Epoch 16, batch 7000, loss[loss=0.2131, simple_loss=0.3004, pruned_loss=0.06293, over 16639.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2859, pruned_loss=0.05243, over 31889348.47 frames. ], batch size: 220, lr: 2.18e-03, grad_scale: 16.0 2023-10-12 07:50:00,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=988614.6666666666, ans=0.95 2023-10-12 07:50:01,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=988614.6666666666, ans=0.2 2023-10-12 07:50:05,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-10-12 07:50:28,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=988708.0, ans=0.2 2023-10-12 07:51:05,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=988848.0, ans=0.1 2023-10-12 07:51:09,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=988894.6666666666, ans=0.125 2023-10-12 07:51:27,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.748e+02 1.924e+02 2.105e+02 2.634e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-12 07:51:33,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=988988.0, ans=0.1 2023-10-12 07:51:54,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-10-12 07:51:55,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=989081.3333333334, ans=0.125 2023-10-12 07:52:00,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-10-12 07:52:19,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=989174.6666666666, ans=0.05 2023-10-12 07:52:33,069 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-10-12 07:52:41,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-10-12 07:53:07,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=989361.3333333334, ans=0.0 2023-10-12 07:53:15,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=989408.0, ans=0.125 2023-10-12 07:53:19,082 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.758e+02 1.988e+02 2.188e+02 3.079e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-12 07:53:43,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.54 vs. limit=15.0 2023-10-12 07:53:48,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=989501.3333333334, ans=0.125 2023-10-12 07:53:58,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=989548.0, ans=0.125 2023-10-12 07:54:01,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=989548.0, ans=0.125 2023-10-12 07:54:11,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=989594.6666666666, ans=0.05 2023-10-12 07:54:18,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=989641.3333333334, ans=0.125 2023-10-12 07:54:40,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=989688.0, ans=0.125 2023-10-12 07:54:42,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=989734.6666666666, ans=0.125 2023-10-12 07:54:53,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=989781.3333333334, ans=0.125 2023-10-12 07:54:54,451 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.01 vs. limit=15.0 2023-10-12 07:55:14,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=989828.0, ans=0.1 2023-10-12 07:55:29,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.710e+02 1.886e+02 2.046e+02 3.321e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-12 07:55:37,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=989968.0, ans=0.125 2023-10-12 07:55:50,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=989968.0, ans=0.125 2023-10-12 07:55:58,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=990014.6666666666, ans=0.2 2023-10-12 07:56:00,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=990014.6666666666, ans=0.0 2023-10-12 07:56:02,185 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=12.0 2023-10-12 07:56:19,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=990061.3333333334, ans=0.125 2023-10-12 07:56:29,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=990108.0, ans=0.0 2023-10-12 07:56:29,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.59 vs. limit=15.0 2023-10-12 07:56:33,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=990154.6666666666, ans=0.1 2023-10-12 07:57:03,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.64 vs. limit=10.0 2023-10-12 07:57:20,621 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-10-12 07:57:28,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.328e+02 1.684e+02 1.847e+02 2.047e+02 3.622e+02, threshold=3.694e+02, percent-clipped=0.0 2023-10-12 07:57:35,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=990388.0, ans=0.2 2023-10-12 07:58:14,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=990574.6666666666, ans=0.1 2023-10-12 07:58:15,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=990574.6666666666, ans=0.2 2023-10-12 07:58:21,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=990574.6666666666, ans=0.05 2023-10-12 07:58:27,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=990621.3333333334, ans=0.0 2023-10-12 07:58:38,014 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.23 vs. limit=15.0 2023-10-12 07:58:58,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=990761.3333333334, ans=0.0 2023-10-12 07:59:20,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.752e+02 1.965e+02 2.156e+02 2.768e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-12 07:59:20,250 INFO [train.py:1031] (0/4) Epoch 16, batch 7500, loss[loss=0.1917, simple_loss=0.2836, pruned_loss=0.04988, over 17001.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2857, pruned_loss=0.05245, over 32068328.18 frames. ], batch size: 123, lr: 2.18e-03, grad_scale: 16.0 2023-10-12 07:59:34,004 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-10-12 07:59:40,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=990901.3333333334, ans=0.0 2023-10-12 07:59:41,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=990901.3333333334, ans=0.125 2023-10-12 07:59:43,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.10 vs. limit=10.0 2023-10-12 07:59:49,313 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.72 vs. limit=10.0 2023-10-12 08:00:11,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=991041.3333333334, ans=0.125 2023-10-12 08:00:13,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=991041.3333333334, ans=0.1 2023-10-12 08:00:20,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=991088.0, ans=0.0 2023-10-12 08:00:20,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=991088.0, ans=0.125 2023-10-12 08:00:20,321 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-10-12 08:01:12,786 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.761e+02 1.959e+02 2.305e+02 3.215e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-12 08:01:18,036 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.72 vs. limit=22.5 2023-10-12 08:01:18,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=991321.3333333334, ans=0.05 2023-10-12 08:01:27,687 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-10-12 08:01:42,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=991414.6666666666, ans=0.1 2023-10-12 08:01:42,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=991414.6666666666, ans=0.1 2023-10-12 08:01:46,529 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-10-12 08:01:58,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=991461.3333333334, ans=0.0 2023-10-12 08:02:33,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=991601.3333333334, ans=0.035 2023-10-12 08:02:58,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=991694.6666666666, ans=0.2 2023-10-12 08:03:02,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=991694.6666666666, ans=0.125 2023-10-12 08:03:16,752 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.726e+02 1.968e+02 2.205e+02 3.050e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-12 08:03:23,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=991788.0, ans=0.1 2023-10-12 08:03:25,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=991788.0, ans=0.125 2023-10-12 08:03:25,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=991788.0, ans=0.1 2023-10-12 08:03:31,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=991834.6666666666, ans=0.05 2023-10-12 08:03:35,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=991834.6666666666, ans=0.1 2023-10-12 08:03:43,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=991881.3333333334, ans=0.125 2023-10-12 08:03:52,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=991928.0, ans=0.125 2023-10-12 08:03:57,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=991928.0, ans=0.125 2023-10-12 08:04:00,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=991974.6666666666, ans=0.0 2023-10-12 08:04:01,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=991974.6666666666, ans=0.125 2023-10-12 08:04:32,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=992114.6666666666, ans=0.0 2023-10-12 08:04:44,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=992161.3333333334, ans=0.125 2023-10-12 08:05:07,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.763e+02 1.973e+02 2.119e+02 2.807e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-12 08:05:18,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992301.3333333334, ans=0.1 2023-10-12 08:05:26,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=992301.3333333334, ans=0.125 2023-10-12 08:05:28,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=992301.3333333334, ans=0.125 2023-10-12 08:06:14,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=992488.0, ans=0.125 2023-10-12 08:06:16,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=992488.0, ans=0.125 2023-10-12 08:06:18,873 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=22.5 2023-10-12 08:06:35,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=992581.3333333334, ans=0.125 2023-10-12 08:06:47,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=992628.0, ans=0.125 2023-10-12 08:06:47,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=992628.0, ans=0.125 2023-10-12 08:07:07,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.754e+02 1.947e+02 2.096e+02 3.192e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-12 08:07:16,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=992721.3333333334, ans=0.07 2023-10-12 08:07:24,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=992768.0, ans=0.125 2023-10-12 08:07:40,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992814.6666666666, ans=0.1 2023-10-12 08:08:17,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=993001.3333333334, ans=0.1 2023-10-12 08:08:38,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=993048.0, ans=0.125 2023-10-12 08:08:58,385 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.65 vs. limit=10.0 2023-10-12 08:08:59,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=993141.3333333334, ans=0.1 2023-10-12 08:09:04,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=993141.3333333334, ans=0.025 2023-10-12 08:09:06,003 INFO [train.py:1031] (0/4) Epoch 16, batch 8000, loss[loss=0.1937, simple_loss=0.2857, pruned_loss=0.05085, over 16854.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.285, pruned_loss=0.05191, over 32210149.30 frames. ], batch size: 116, lr: 2.18e-03, grad_scale: 32.0 2023-10-12 08:09:07,046 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.635e+02 1.813e+02 1.989e+02 2.922e+02, threshold=3.626e+02, percent-clipped=0.0 2023-10-12 08:09:08,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=993188.0, ans=0.125 2023-10-12 08:09:13,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=15.0 2023-10-12 08:09:25,194 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-10-12 08:09:33,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=993281.3333333334, ans=0.0 2023-10-12 08:09:41,583 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-10-12 08:09:42,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=993328.0, ans=0.1 2023-10-12 08:10:06,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=993421.3333333334, ans=0.0 2023-10-12 08:10:08,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=993421.3333333334, ans=0.125 2023-10-12 08:10:18,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=993468.0, ans=0.125 2023-10-12 08:10:35,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=993561.3333333334, ans=0.2 2023-10-12 08:10:40,010 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-10-12 08:10:46,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=993608.0, ans=0.2 2023-10-12 08:10:57,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.672e+02 1.839e+02 2.250e+02 3.548e+02, threshold=3.679e+02, percent-clipped=0.0 2023-10-12 08:10:57,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=993654.6666666666, ans=0.125 2023-10-12 08:11:06,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=993701.3333333334, ans=0.0 2023-10-12 08:11:09,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=993701.3333333334, ans=0.2 2023-10-12 08:11:11,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=993701.3333333334, ans=0.125 2023-10-12 08:11:15,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=993701.3333333334, ans=0.0 2023-10-12 08:11:24,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=993748.0, ans=0.2 2023-10-12 08:11:44,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.20 vs. limit=22.5 2023-10-12 08:12:28,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=993934.6666666666, ans=0.125 2023-10-12 08:12:35,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=993981.3333333334, ans=0.125 2023-10-12 08:12:47,168 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.94 vs. limit=15.0 2023-10-12 08:13:03,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=994074.6666666666, ans=0.125 2023-10-12 08:13:10,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.725e+02 1.880e+02 2.008e+02 2.715e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-12 08:13:23,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=994168.0, ans=0.0 2023-10-12 08:13:37,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=994214.6666666666, ans=0.125 2023-10-12 08:13:52,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=994308.0, ans=0.0 2023-10-12 08:14:12,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=994401.3333333334, ans=0.1 2023-10-12 08:14:15,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=994401.3333333334, ans=0.1 2023-10-12 08:14:41,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=994494.6666666666, ans=0.0 2023-10-12 08:14:46,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=994541.3333333334, ans=0.0 2023-10-12 08:15:00,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.759e+02 1.982e+02 2.179e+02 2.666e+02, threshold=3.965e+02, percent-clipped=0.0 2023-10-12 08:15:02,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=994588.0, ans=0.125 2023-10-12 08:15:06,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=994588.0, ans=0.0 2023-10-12 08:15:26,762 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.51 vs. limit=15.0 2023-10-12 08:15:30,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=994681.3333333334, ans=0.125 2023-10-12 08:15:39,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=994728.0, ans=0.125 2023-10-12 08:15:50,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=994774.6666666666, ans=0.0 2023-10-12 08:15:51,010 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:15:52,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=994774.6666666666, ans=0.125 2023-10-12 08:16:03,200 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-10-12 08:16:21,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=994914.6666666666, ans=0.025 2023-10-12 08:16:48,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.720e+02 1.857e+02 2.097e+02 3.011e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-12 08:16:49,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=995054.6666666666, ans=0.1 2023-10-12 08:16:53,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=995054.6666666666, ans=0.02 2023-10-12 08:17:02,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=995101.3333333334, ans=0.0 2023-10-12 08:17:24,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=995148.0, ans=0.125 2023-10-12 08:17:30,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=12.0 2023-10-12 08:17:42,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=995241.3333333334, ans=0.0 2023-10-12 08:17:52,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=995288.0, ans=0.1 2023-10-12 08:18:05,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=995334.6666666666, ans=0.1 2023-10-12 08:18:21,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=995428.0, ans=0.125 2023-10-12 08:18:35,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=995474.6666666666, ans=0.0 2023-10-12 08:18:49,053 INFO [train.py:1031] (0/4) Epoch 16, batch 8500, loss[loss=0.1805, simple_loss=0.2773, pruned_loss=0.04191, over 16851.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2852, pruned_loss=0.05193, over 32335312.59 frames. ], batch size: 175, lr: 2.18e-03, grad_scale: 32.0 2023-10-12 08:18:50,854 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.769e+02 1.910e+02 2.169e+02 3.720e+02, threshold=3.821e+02, percent-clipped=1.0 2023-10-12 08:18:52,760 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:18:53,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=995521.3333333334, ans=0.1 2023-10-12 08:18:57,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=995521.3333333334, ans=0.125 2023-10-12 08:18:59,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=995568.0, ans=0.125 2023-10-12 08:19:21,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=995661.3333333334, ans=0.0 2023-10-12 08:19:35,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=995708.0, ans=0.125 2023-10-12 08:19:46,412 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.54 vs. limit=15.0 2023-10-12 08:19:48,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=12.0 2023-10-12 08:19:50,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=995754.6666666666, ans=0.1 2023-10-12 08:19:53,950 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-10-12 08:20:30,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=995894.6666666666, ans=0.125 2023-10-12 08:20:31,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=995894.6666666666, ans=0.0 2023-10-12 08:20:33,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=995941.3333333334, ans=0.125 2023-10-12 08:20:35,421 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=12.0 2023-10-12 08:20:48,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=995988.0, ans=0.2 2023-10-12 08:20:53,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.872e+02 2.048e+02 2.400e+02 3.218e+02, threshold=4.096e+02, percent-clipped=0.0 2023-10-12 08:21:07,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.28 vs. limit=15.0 2023-10-12 08:21:26,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=996128.0, ans=0.0 2023-10-12 08:21:26,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=996128.0, ans=0.0 2023-10-12 08:21:36,513 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=22.5 2023-10-12 08:21:38,081 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.65 vs. limit=15.0 2023-10-12 08:21:44,907 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=12.0 2023-10-12 08:22:04,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=996268.0, ans=0.1 2023-10-12 08:22:07,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=996268.0, ans=0.2 2023-10-12 08:22:55,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.281e+02 1.626e+02 1.785e+02 1.960e+02 2.853e+02, threshold=3.570e+02, percent-clipped=0.0 2023-10-12 08:22:57,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=996454.6666666666, ans=0.0 2023-10-12 08:22:58,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=996454.6666666666, ans=0.025 2023-10-12 08:23:03,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.68 vs. limit=10.0 2023-10-12 08:23:17,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=996548.0, ans=0.125 2023-10-12 08:23:29,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=996594.6666666666, ans=0.2 2023-10-12 08:23:31,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.83 vs. limit=15.0 2023-10-12 08:23:37,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.74 vs. limit=15.0 2023-10-12 08:23:38,998 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-10-12 08:23:44,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=996641.3333333334, ans=10.0 2023-10-12 08:23:53,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=996688.0, ans=0.125 2023-10-12 08:24:36,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=996828.0, ans=0.0 2023-10-12 08:24:40,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=996874.6666666666, ans=0.125 2023-10-12 08:24:40,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=996874.6666666666, ans=22.5 2023-10-12 08:24:52,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.646e+02 1.818e+02 2.058e+02 3.166e+02, threshold=3.635e+02, percent-clipped=0.0 2023-10-12 08:25:21,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=997014.6666666666, ans=0.0 2023-10-12 08:25:45,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=997154.6666666666, ans=0.0 2023-10-12 08:26:29,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=997341.3333333334, ans=0.125 2023-10-12 08:26:42,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.732e+02 1.904e+02 2.087e+02 2.879e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-12 08:26:43,777 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.63 vs. limit=15.0 2023-10-12 08:26:43,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.91 vs. limit=10.0 2023-10-12 08:27:02,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=997481.3333333334, ans=0.0 2023-10-12 08:27:07,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.36 vs. limit=15.0 2023-10-12 08:27:13,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=997528.0, ans=0.125 2023-10-12 08:27:13,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=997528.0, ans=0.0 2023-10-12 08:27:20,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=997528.0, ans=0.5 2023-10-12 08:27:26,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=997574.6666666666, ans=0.5 2023-10-12 08:27:26,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=997574.6666666666, ans=0.125 2023-10-12 08:27:36,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=997621.3333333334, ans=0.125 2023-10-12 08:27:43,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=997621.3333333334, ans=0.1 2023-10-12 08:27:47,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=997668.0, ans=0.125 2023-10-12 08:28:06,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=997714.6666666666, ans=0.0 2023-10-12 08:28:07,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=997761.3333333334, ans=0.2 2023-10-12 08:28:12,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=997761.3333333334, ans=0.125 2023-10-12 08:28:13,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=997761.3333333334, ans=0.04949747468305833 2023-10-12 08:28:25,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=997808.0, ans=0.125 2023-10-12 08:28:27,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=997808.0, ans=0.0 2023-10-12 08:28:29,473 INFO [train.py:1031] (0/4) Epoch 16, batch 9000, loss[loss=0.1921, simple_loss=0.29, pruned_loss=0.04707, over 16933.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2848, pruned_loss=0.05183, over 32438228.67 frames. ], batch size: 93, lr: 2.17e-03, grad_scale: 16.0 2023-10-12 08:28:34,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.673e+02 1.854e+02 1.998e+02 3.055e+02, threshold=3.707e+02, percent-clipped=0.0 2023-10-12 08:28:56,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=997948.0, ans=0.05 2023-10-12 08:29:08,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=997994.6666666666, ans=0.125 2023-10-12 08:29:08,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=997994.6666666666, ans=0.1 2023-10-12 08:29:08,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=997994.6666666666, ans=0.125 2023-10-12 08:29:14,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-10-12 08:29:30,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=998088.0, ans=0.0 2023-10-12 08:29:44,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=998181.3333333334, ans=22.5 2023-10-12 08:30:03,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=998228.0, ans=0.0 2023-10-12 08:30:07,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=998274.6666666666, ans=0.125 2023-10-12 08:30:07,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=998274.6666666666, ans=0.125 2023-10-12 08:30:12,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=998274.6666666666, ans=0.1 2023-10-12 08:30:19,683 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.747e+02 1.882e+02 2.041e+02 2.714e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-12 08:30:20,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=998321.3333333334, ans=0.07 2023-10-12 08:30:21,004 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:30:21,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=998321.3333333334, ans=0.2 2023-10-12 08:31:01,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=998508.0, ans=0.125 2023-10-12 08:31:03,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=998508.0, ans=0.5 2023-10-12 08:31:14,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=998554.6666666666, ans=0.0 2023-10-12 08:31:15,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=998554.6666666666, ans=0.125 2023-10-12 08:31:20,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=998601.3333333334, ans=0.0 2023-10-12 08:31:43,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=998694.6666666666, ans=0.125 2023-10-12 08:31:43,564 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.89 vs. limit=15.0 2023-10-12 08:31:46,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=998694.6666666666, ans=0.2 2023-10-12 08:31:56,583 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-10-12 08:32:03,650 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.805e+02 1.932e+02 2.120e+02 3.278e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-12 08:32:28,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=998881.3333333334, ans=0.0 2023-10-12 08:32:33,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=998928.0, ans=0.125 2023-10-12 08:32:39,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=998928.0, ans=0.2 2023-10-12 08:32:50,861 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.69 vs. limit=15.0 2023-10-12 08:32:58,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=999021.3333333334, ans=0.125 2023-10-12 08:33:06,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=999068.0, ans=0.2 2023-10-12 08:33:19,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=999114.6666666666, ans=0.2 2023-10-12 08:33:21,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.29 vs. limit=15.0 2023-10-12 08:33:23,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=999114.6666666666, ans=0.1 2023-10-12 08:33:27,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.35 vs. limit=10.0 2023-10-12 08:33:37,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=999208.0, ans=0.1 2023-10-12 08:33:45,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=999208.0, ans=0.0 2023-10-12 08:33:52,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.758e+02 1.917e+02 2.137e+02 3.201e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-12 08:34:03,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=999301.3333333334, ans=0.0 2023-10-12 08:34:12,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=999348.0, ans=0.125 2023-10-12 08:34:20,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=999348.0, ans=8.0 2023-10-12 08:34:24,311 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-10-12 08:34:40,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=999441.3333333334, ans=0.125 2023-10-12 08:34:50,180 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.27 vs. limit=15.0 2023-10-12 08:34:59,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.09 vs. limit=22.5 2023-10-12 08:35:04,961 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:35:14,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=15.0 2023-10-12 08:35:27,366 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.125e-03 2023-10-12 08:35:29,235 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=15.0 2023-10-12 08:35:55,899 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.760e+02 1.930e+02 2.138e+02 2.816e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-12 08:35:57,373 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:36:02,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=999768.0, ans=0.125 2023-10-12 08:36:22,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=999814.6666666666, ans=0.1 2023-10-12 08:36:47,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=999908.0, ans=0.125 2023-10-12 08:36:55,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=22.5 2023-10-12 08:37:17,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1000048.0, ans=0.125 2023-10-12 08:37:34,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1000094.6666666666, ans=0.025 2023-10-12 08:37:35,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1000094.6666666666, ans=10.0 2023-10-12 08:37:49,640 INFO [train.py:1031] (0/4) Epoch 16, batch 9500, loss[loss=0.2046, simple_loss=0.3055, pruned_loss=0.05184, over 16857.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2855, pruned_loss=0.05217, over 32492892.21 frames. ], batch size: 175, lr: 2.17e-03, grad_scale: 32.0 2023-10-12 08:37:53,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.744e+02 1.912e+02 2.113e+02 3.274e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-12 08:38:08,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-12 08:38:20,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1000328.0, ans=0.0 2023-10-12 08:38:21,355 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.19 vs. limit=15.0 2023-10-12 08:38:22,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1000328.0, ans=0.0 2023-10-12 08:38:27,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1000328.0, ans=0.125 2023-10-12 08:38:36,359 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=15.0 2023-10-12 08:38:37,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1000374.6666666666, ans=0.125 2023-10-12 08:38:38,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1000374.6666666666, ans=0.2 2023-10-12 08:38:39,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1000374.6666666666, ans=0.2 2023-10-12 08:39:20,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1000561.3333333334, ans=0.0 2023-10-12 08:39:28,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1000608.0, ans=0.2 2023-10-12 08:39:40,089 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:39:44,448 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.782e+02 1.948e+02 2.242e+02 3.446e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-12 08:39:50,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1000701.3333333334, ans=0.125 2023-10-12 08:39:51,022 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-12 08:39:54,494 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-10-12 08:39:55,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.07 vs. limit=15.0 2023-10-12 08:39:59,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1000701.3333333334, ans=0.1 2023-10-12 08:40:00,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1000701.3333333334, ans=0.125 2023-10-12 08:40:09,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.36 vs. limit=15.0 2023-10-12 08:40:11,795 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.52 vs. limit=12.0 2023-10-12 08:40:12,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1000748.0, ans=0.0 2023-10-12 08:40:20,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1000794.6666666666, ans=0.04949747468305833 2023-10-12 08:40:22,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.98 vs. limit=15.0 2023-10-12 08:40:25,673 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-10-12 08:40:30,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1000841.3333333334, ans=0.125 2023-10-12 08:40:32,085 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.23 vs. limit=15.0 2023-10-12 08:40:42,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1000888.0, ans=0.125 2023-10-12 08:41:28,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-10-12 08:41:32,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.98 vs. limit=15.0 2023-10-12 08:41:38,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.707e+02 1.840e+02 2.021e+02 2.708e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-12 08:41:38,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1001121.3333333334, ans=0.2 2023-10-12 08:41:58,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1001214.6666666666, ans=0.1 2023-10-12 08:42:19,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1001308.0, ans=0.125 2023-10-12 08:42:26,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1001354.6666666666, ans=0.125 2023-10-12 08:42:32,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1001354.6666666666, ans=0.125 2023-10-12 08:42:35,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1001354.6666666666, ans=0.0 2023-10-12 08:42:41,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1001401.3333333334, ans=0.1 2023-10-12 08:43:02,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.55 vs. limit=6.0 2023-10-12 08:43:20,908 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.14 vs. limit=15.0 2023-10-12 08:43:27,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1001588.0, ans=0.125 2023-10-12 08:43:30,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.736e+02 1.891e+02 2.112e+02 3.037e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-12 08:43:44,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1001634.6666666666, ans=0.125 2023-10-12 08:44:05,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1001728.0, ans=0.0 2023-10-12 08:44:09,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.53 vs. limit=15.0 2023-10-12 08:44:31,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1001868.0, ans=0.125 2023-10-12 08:44:32,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1001868.0, ans=0.0 2023-10-12 08:44:41,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1001914.6666666666, ans=0.0 2023-10-12 08:44:41,816 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-10-12 08:44:45,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1001914.6666666666, ans=0.0 2023-10-12 08:45:07,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1002008.0, ans=0.05 2023-10-12 08:45:19,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1002054.6666666666, ans=0.0 2023-10-12 08:45:21,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.640e+02 1.822e+02 1.991e+02 3.118e+02, threshold=3.645e+02, percent-clipped=0.0 2023-10-12 08:45:25,618 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-10-12 08:45:29,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-10-12 08:46:08,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1002288.0, ans=0.0 2023-10-12 08:46:08,344 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.07 vs. limit=15.0 2023-10-12 08:46:13,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1002288.0, ans=0.125 2023-10-12 08:46:14,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1002288.0, ans=0.0 2023-10-12 08:46:40,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1002428.0, ans=0.5 2023-10-12 08:46:49,780 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-10-12 08:46:59,961 INFO [train.py:1031] (0/4) Epoch 16, batch 10000, loss[loss=0.2046, simple_loss=0.2893, pruned_loss=0.05997, over 15548.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2847, pruned_loss=0.05186, over 32545069.12 frames. ], batch size: 35, lr: 2.17e-03, grad_scale: 16.0 2023-10-12 08:47:06,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.776e+02 1.934e+02 2.130e+02 3.412e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 08:47:10,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1002568.0, ans=0.0 2023-10-12 08:47:11,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1002568.0, ans=0.0 2023-10-12 08:47:22,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1002614.6666666666, ans=0.125 2023-10-12 08:47:29,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1002614.6666666666, ans=0.1 2023-10-12 08:47:59,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.33 vs. limit=15.0 2023-10-12 08:48:06,699 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.46 vs. limit=12.0 2023-10-12 08:48:30,125 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 08:48:42,013 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.97 vs. limit=15.0 2023-10-12 08:48:58,760 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.771e+02 1.929e+02 2.157e+02 3.084e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-12 08:49:07,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=1003034.6666666666, ans=15.0 2023-10-12 08:49:23,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1003081.3333333334, ans=0.125 2023-10-12 08:49:26,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1003128.0, ans=0.2 2023-10-12 08:49:34,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1003128.0, ans=0.1 2023-10-12 08:49:37,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1003174.6666666666, ans=0.025 2023-10-12 08:49:40,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1003174.6666666666, ans=0.125 2023-10-12 08:49:46,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1003221.3333333334, ans=0.125 2023-10-12 08:49:59,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1003268.0, ans=0.125 2023-10-12 08:50:00,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1003268.0, ans=0.0 2023-10-12 08:50:04,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1003268.0, ans=0.09899494936611666 2023-10-12 08:50:13,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1003314.6666666666, ans=0.0 2023-10-12 08:50:18,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1003314.6666666666, ans=0.1 2023-10-12 08:50:50,863 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.707e+02 1.827e+02 2.030e+02 3.084e+02, threshold=3.654e+02, percent-clipped=0.0 2023-10-12 08:51:06,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1003501.3333333334, ans=0.0 2023-10-12 08:51:11,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1003548.0, ans=0.1 2023-10-12 08:51:25,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1003594.6666666666, ans=0.0 2023-10-12 08:51:26,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1003594.6666666666, ans=0.2 2023-10-12 08:51:36,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1003641.3333333334, ans=0.05 2023-10-12 08:51:41,029 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.78 vs. limit=10.0 2023-10-12 08:51:50,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1003688.0, ans=10.0 2023-10-12 08:51:55,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1003734.6666666666, ans=0.0 2023-10-12 08:52:03,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1003734.6666666666, ans=0.07 2023-10-12 08:52:08,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1003781.3333333334, ans=0.0 2023-10-12 08:52:13,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1003781.3333333334, ans=0.0 2023-10-12 08:52:13,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1003781.3333333334, ans=0.125 2023-10-12 08:52:26,203 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-10-12 08:52:28,926 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.67 vs. limit=6.0 2023-10-12 08:52:33,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1003874.6666666666, ans=0.0 2023-10-12 08:52:46,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1003921.3333333334, ans=0.125 2023-10-12 08:52:49,796 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.718e+02 1.866e+02 2.048e+02 3.006e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-12 08:53:05,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1004014.6666666666, ans=0.1 2023-10-12 08:53:23,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.66 vs. limit=15.0 2023-10-12 08:54:19,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1004294.6666666666, ans=0.0 2023-10-12 08:54:22,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-10-12 08:54:32,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1004341.3333333334, ans=0.0 2023-10-12 08:54:35,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1004341.3333333334, ans=0.125 2023-10-12 08:54:46,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.777e+02 1.914e+02 2.136e+02 3.654e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-12 08:54:55,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1004434.6666666666, ans=0.04949747468305833 2023-10-12 08:54:59,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1004481.3333333334, ans=0.125 2023-10-12 08:55:20,090 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.39 vs. limit=15.0 2023-10-12 08:55:36,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1004621.3333333334, ans=0.125 2023-10-12 08:55:43,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1004621.3333333334, ans=0.5 2023-10-12 08:55:46,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1004668.0, ans=0.0 2023-10-12 08:55:47,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1004668.0, ans=0.0 2023-10-12 08:55:57,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1004668.0, ans=0.1 2023-10-12 08:56:00,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1004714.6666666666, ans=0.0 2023-10-12 08:56:19,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1004761.3333333334, ans=0.0 2023-10-12 08:56:30,450 INFO [train.py:1031] (0/4) Epoch 16, batch 10500, loss[loss=0.2018, simple_loss=0.2899, pruned_loss=0.05689, over 16897.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2851, pruned_loss=0.05185, over 32601111.76 frames. ], batch size: 72, lr: 2.17e-03, grad_scale: 16.0 2023-10-12 08:56:38,455 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.705e+02 1.889e+02 2.105e+02 2.689e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-12 08:56:43,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1004901.3333333334, ans=0.0 2023-10-12 08:56:43,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1004901.3333333334, ans=0.125 2023-10-12 08:57:05,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1004994.6666666666, ans=0.125 2023-10-12 08:57:14,981 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-10-12 08:57:16,877 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.51 vs. limit=15.0 2023-10-12 08:57:16,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=12.0 2023-10-12 08:57:21,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1005088.0, ans=0.125 2023-10-12 08:57:21,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1005088.0, ans=0.125 2023-10-12 08:57:32,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1005088.0, ans=0.125 2023-10-12 08:57:54,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1005181.3333333334, ans=0.125 2023-10-12 08:57:54,485 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=12.0 2023-10-12 08:58:15,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1005274.6666666666, ans=0.125 2023-10-12 08:58:17,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1005274.6666666666, ans=0.0 2023-10-12 08:58:21,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1005274.6666666666, ans=0.125 2023-10-12 08:58:23,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-10-12 08:58:25,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1005274.6666666666, ans=0.09899494936611666 2023-10-12 08:58:34,864 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.741e+02 1.855e+02 2.083e+02 2.812e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-12 08:58:44,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1005368.0, ans=0.0 2023-10-12 08:58:51,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1005414.6666666666, ans=0.1 2023-10-12 08:58:56,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1005414.6666666666, ans=0.1 2023-10-12 08:59:00,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1005461.3333333334, ans=0.125 2023-10-12 08:59:01,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1005461.3333333334, ans=0.2 2023-10-12 08:59:05,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1005461.3333333334, ans=0.1 2023-10-12 08:59:08,612 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.74 vs. limit=15.0 2023-10-12 08:59:09,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1005508.0, ans=0.5 2023-10-12 08:59:27,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1005554.6666666666, ans=0.1 2023-10-12 08:59:30,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1005601.3333333334, ans=0.0 2023-10-12 08:59:42,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-10-12 09:00:06,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.03 vs. limit=15.0 2023-10-12 09:00:27,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1005788.0, ans=0.0 2023-10-12 09:00:29,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.752e+02 1.937e+02 2.193e+02 3.672e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-12 09:00:32,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1005834.6666666666, ans=0.125 2023-10-12 09:00:35,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1005834.6666666666, ans=0.0 2023-10-12 09:00:37,959 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-10-12 09:00:47,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1005881.3333333334, ans=0.2 2023-10-12 09:00:53,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1005881.3333333334, ans=0.0 2023-10-12 09:01:18,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1006021.3333333334, ans=0.1 2023-10-12 09:01:22,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1006021.3333333334, ans=0.0 2023-10-12 09:01:37,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1006068.0, ans=0.125 2023-10-12 09:01:44,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.30 vs. limit=15.0 2023-10-12 09:01:56,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.42 vs. limit=15.0 2023-10-12 09:02:13,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-10-12 09:02:18,899 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:02:22,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.769e+02 1.877e+02 2.188e+02 3.282e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-12 09:02:41,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1006348.0, ans=0.2 2023-10-12 09:02:46,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1006394.6666666666, ans=0.07 2023-10-12 09:02:59,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1006441.3333333334, ans=0.0 2023-10-12 09:03:10,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1006488.0, ans=0.025 2023-10-12 09:03:14,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1006488.0, ans=0.125 2023-10-12 09:03:18,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1006534.6666666666, ans=0.125 2023-10-12 09:03:31,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1006581.3333333334, ans=0.2 2023-10-12 09:03:53,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1006628.0, ans=0.035 2023-10-12 09:04:06,876 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-10-12 09:04:14,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.595e+02 1.768e+02 2.018e+02 2.913e+02, threshold=3.537e+02, percent-clipped=0.0 2023-10-12 09:04:17,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1006768.0, ans=0.0 2023-10-12 09:04:23,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1006768.0, ans=0.0 2023-10-12 09:04:29,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.02 vs. limit=15.0 2023-10-12 09:05:01,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1006954.6666666666, ans=0.125 2023-10-12 09:05:51,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1007141.3333333334, ans=0.125 2023-10-12 09:05:55,519 INFO [train.py:1031] (0/4) Epoch 16, batch 11000, loss[loss=0.2108, simple_loss=0.308, pruned_loss=0.05683, over 16882.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2849, pruned_loss=0.05191, over 32592891.07 frames. ], batch size: 116, lr: 2.16e-03, grad_scale: 32.0 2023-10-12 09:06:02,265 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.04 vs. limit=10.0 2023-10-12 09:06:02,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.860e+02 2.060e+02 2.363e+02 3.305e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-12 09:06:04,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1007188.0, ans=0.025 2023-10-12 09:06:15,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1007234.6666666666, ans=0.125 2023-10-12 09:06:38,332 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.55 vs. limit=10.0 2023-10-12 09:06:46,060 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.23 vs. limit=15.0 2023-10-12 09:06:58,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1007421.3333333334, ans=0.125 2023-10-12 09:07:03,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1007468.0, ans=0.5 2023-10-12 09:07:13,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1007468.0, ans=0.125 2023-10-12 09:07:24,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1007514.6666666666, ans=0.125 2023-10-12 09:07:30,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1007561.3333333334, ans=0.125 2023-10-12 09:07:40,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1007608.0, ans=0.125 2023-10-12 09:07:57,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1007654.6666666666, ans=0.0 2023-10-12 09:07:59,742 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.697e+02 1.925e+02 2.135e+02 3.411e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-12 09:08:13,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1007701.3333333334, ans=0.125 2023-10-12 09:08:20,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1007748.0, ans=0.1 2023-10-12 09:08:21,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1007748.0, ans=0.125 2023-10-12 09:08:32,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1007794.6666666666, ans=0.125 2023-10-12 09:08:45,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1007841.3333333334, ans=0.0 2023-10-12 09:09:17,547 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-216000.pt 2023-10-12 09:09:33,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1008028.0, ans=0.125 2023-10-12 09:09:51,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1008121.3333333334, ans=0.0 2023-10-12 09:09:52,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1008121.3333333334, ans=0.1 2023-10-12 09:09:58,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.635e+02 1.787e+02 2.079e+02 3.244e+02, threshold=3.573e+02, percent-clipped=0.0 2023-10-12 09:10:10,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1008214.6666666666, ans=0.09899494936611666 2023-10-12 09:10:14,513 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.71 vs. limit=12.0 2023-10-12 09:10:33,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1008308.0, ans=0.015 2023-10-12 09:10:52,556 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:10:53,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1008354.6666666666, ans=0.125 2023-10-12 09:11:08,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1008401.3333333334, ans=0.0 2023-10-12 09:11:26,404 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.86 vs. limit=15.0 2023-10-12 09:11:27,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1008494.6666666666, ans=0.1 2023-10-12 09:11:27,691 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.33 vs. limit=15.0 2023-10-12 09:11:30,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1008494.6666666666, ans=0.125 2023-10-12 09:11:30,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-10-12 09:11:43,640 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=15.0 2023-10-12 09:11:49,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1008588.0, ans=0.07 2023-10-12 09:11:52,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1008588.0, ans=0.125 2023-10-12 09:11:55,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.752e+02 1.933e+02 2.140e+02 3.243e+02, threshold=3.866e+02, percent-clipped=0.0 2023-10-12 09:11:58,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1008634.6666666666, ans=0.0 2023-10-12 09:12:06,050 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.34 vs. limit=22.5 2023-10-12 09:12:09,063 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.60 vs. limit=12.0 2023-10-12 09:12:37,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1008774.6666666666, ans=0.125 2023-10-12 09:12:53,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1008821.3333333334, ans=0.0 2023-10-12 09:12:58,435 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:13:00,795 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.80 vs. limit=22.5 2023-10-12 09:13:02,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1008868.0, ans=0.1 2023-10-12 09:13:16,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.20 vs. limit=22.5 2023-10-12 09:13:22,165 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:13:46,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.804e+02 1.965e+02 2.230e+02 2.923e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-12 09:13:48,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1009054.6666666666, ans=0.0 2023-10-12 09:13:53,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1009101.3333333334, ans=0.125 2023-10-12 09:13:53,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.33 vs. limit=22.5 2023-10-12 09:13:59,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1009101.3333333334, ans=0.125 2023-10-12 09:14:01,739 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:14:03,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1009148.0, ans=0.125 2023-10-12 09:14:15,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1009194.6666666666, ans=0.125 2023-10-12 09:14:25,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1009194.6666666666, ans=0.125 2023-10-12 09:14:42,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1009288.0, ans=0.125 2023-10-12 09:14:50,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1009334.6666666666, ans=0.05 2023-10-12 09:15:09,538 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:15:13,806 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:15:23,755 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.43 vs. limit=15.0 2023-10-12 09:15:31,518 INFO [train.py:1031] (0/4) Epoch 16, batch 11500, loss[loss=0.2322, simple_loss=0.3089, pruned_loss=0.07778, over 15663.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2849, pruned_loss=0.05199, over 32637944.29 frames. ], batch size: 350, lr: 2.16e-03, grad_scale: 32.0 2023-10-12 09:15:38,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.765e+02 1.964e+02 2.149e+02 3.230e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-12 09:15:42,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1009568.0, ans=0.125 2023-10-12 09:15:46,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1009568.0, ans=0.125 2023-10-12 09:15:56,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1009614.6666666666, ans=0.125 2023-10-12 09:16:08,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1009661.3333333334, ans=0.0 2023-10-12 09:16:20,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1009661.3333333334, ans=0.125 2023-10-12 09:16:46,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1009801.3333333334, ans=0.2 2023-10-12 09:16:54,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=15.0 2023-10-12 09:17:39,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1009988.0, ans=0.035 2023-10-12 09:17:43,609 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.788e+02 2.032e+02 2.246e+02 3.203e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-12 09:17:46,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1010034.6666666666, ans=0.125 2023-10-12 09:17:49,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1010034.6666666666, ans=0.07 2023-10-12 09:17:49,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.20 vs. limit=6.0 2023-10-12 09:18:11,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.29 vs. limit=22.5 2023-10-12 09:18:31,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1010221.3333333334, ans=0.1 2023-10-12 09:18:38,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1010221.3333333334, ans=0.2 2023-10-12 09:18:40,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1010268.0, ans=0.09899494936611666 2023-10-12 09:18:48,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1010268.0, ans=0.05 2023-10-12 09:19:04,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1010361.3333333334, ans=0.0 2023-10-12 09:19:13,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1010408.0, ans=0.0 2023-10-12 09:19:17,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1010408.0, ans=0.0 2023-10-12 09:19:21,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1010408.0, ans=0.125 2023-10-12 09:19:25,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1010454.6666666666, ans=0.125 2023-10-12 09:19:31,559 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.659e+02 1.848e+02 1.990e+02 2.899e+02, threshold=3.697e+02, percent-clipped=0.0 2023-10-12 09:19:36,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1010501.3333333334, ans=0.125 2023-10-12 09:19:52,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1010548.0, ans=0.125 2023-10-12 09:19:57,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1010594.6666666666, ans=0.125 2023-10-12 09:19:57,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1010594.6666666666, ans=0.05 2023-10-12 09:19:57,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1010594.6666666666, ans=0.125 2023-10-12 09:20:00,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1010594.6666666666, ans=0.125 2023-10-12 09:20:09,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1010641.3333333334, ans=0.0 2023-10-12 09:20:22,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.58 vs. limit=15.0 2023-10-12 09:20:24,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.06 vs. limit=12.0 2023-10-12 09:20:37,356 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.79 vs. limit=15.0 2023-10-12 09:20:46,124 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.77 vs. limit=15.0 2023-10-12 09:21:12,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1010828.0, ans=0.09899494936611666 2023-10-12 09:21:25,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1010874.6666666666, ans=0.2 2023-10-12 09:21:35,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.709e+02 1.866e+02 2.133e+02 2.961e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-12 09:21:37,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.00 vs. limit=22.5 2023-10-12 09:22:00,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1011014.6666666666, ans=0.0 2023-10-12 09:22:05,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1011061.3333333334, ans=0.0 2023-10-12 09:22:11,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1011061.3333333334, ans=0.0 2023-10-12 09:22:11,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1011061.3333333334, ans=0.1 2023-10-12 09:22:18,973 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-10-12 09:22:20,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1011108.0, ans=0.0 2023-10-12 09:23:37,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.747e+02 1.921e+02 2.164e+02 2.933e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-12 09:23:53,984 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=12.0 2023-10-12 09:24:06,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1011528.0, ans=0.1 2023-10-12 09:24:10,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1011528.0, ans=0.125 2023-10-12 09:24:18,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1011574.6666666666, ans=10.0 2023-10-12 09:24:42,964 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-10-12 09:25:16,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1011808.0, ans=0.125 2023-10-12 09:25:19,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1011808.0, ans=0.125 2023-10-12 09:25:23,690 INFO [train.py:1031] (0/4) Epoch 16, batch 12000, loss[loss=0.2582, simple_loss=0.3175, pruned_loss=0.09949, over 15602.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2852, pruned_loss=0.05184, over 32686715.41 frames. ], batch size: 350, lr: 2.16e-03, grad_scale: 32.0 2023-10-12 09:25:30,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1011854.6666666666, ans=0.125 2023-10-12 09:25:34,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.720e+02 1.883e+02 2.167e+02 3.151e+02, threshold=3.766e+02, percent-clipped=0.0 2023-10-12 09:25:50,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1011948.0, ans=0.2 2023-10-12 09:25:52,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1011948.0, ans=0.0 2023-10-12 09:26:15,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2023-10-12 09:26:27,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1012088.0, ans=0.04949747468305833 2023-10-12 09:26:44,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1012181.3333333334, ans=0.0 2023-10-12 09:26:45,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1012181.3333333334, ans=0.035 2023-10-12 09:26:53,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1012181.3333333334, ans=0.0 2023-10-12 09:27:02,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1012228.0, ans=0.125 2023-10-12 09:27:14,327 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.86 vs. limit=10.0 2023-10-12 09:27:15,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1012274.6666666666, ans=0.125 2023-10-12 09:27:24,032 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.74 vs. limit=10.0 2023-10-12 09:27:26,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.698e+02 1.943e+02 2.263e+02 3.329e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-12 09:27:30,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1012368.0, ans=0.125 2023-10-12 09:27:32,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.60 vs. limit=15.0 2023-10-12 09:27:35,785 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-10-12 09:27:45,271 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.88 vs. limit=10.0 2023-10-12 09:27:49,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1012461.3333333334, ans=0.07 2023-10-12 09:27:53,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1012461.3333333334, ans=0.0 2023-10-12 09:28:05,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1012508.0, ans=0.125 2023-10-12 09:28:09,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1012508.0, ans=0.125 2023-10-12 09:28:18,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-10-12 09:28:29,882 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-10-12 09:28:43,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1012694.6666666666, ans=0.0 2023-10-12 09:28:56,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1012741.3333333334, ans=0.125 2023-10-12 09:29:15,770 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.746e+02 1.954e+02 2.236e+02 3.777e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-12 09:29:16,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1012834.6666666666, ans=0.2 2023-10-12 09:29:38,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1012881.3333333334, ans=0.125 2023-10-12 09:30:18,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1013068.0, ans=0.05 2023-10-12 09:30:23,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013114.6666666666, ans=0.1 2023-10-12 09:30:32,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1013114.6666666666, ans=0.125 2023-10-12 09:30:40,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1013161.3333333334, ans=10.0 2023-10-12 09:30:44,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1013161.3333333334, ans=0.125 2023-10-12 09:30:59,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1013254.6666666666, ans=0.2 2023-10-12 09:31:05,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1013254.6666666666, ans=0.125 2023-10-12 09:31:08,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.740e+02 1.958e+02 2.137e+02 2.610e+02, threshold=3.916e+02, percent-clipped=0.0 2023-10-12 09:31:24,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013348.0, ans=0.1 2023-10-12 09:31:35,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1013394.6666666666, ans=0.125 2023-10-12 09:31:46,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.74 vs. limit=15.0 2023-10-12 09:31:49,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1013441.3333333334, ans=0.125 2023-10-12 09:31:58,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1013488.0, ans=0.125 2023-10-12 09:32:01,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1013488.0, ans=0.125 2023-10-12 09:32:12,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1013534.6666666666, ans=0.0 2023-10-12 09:32:21,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1013581.3333333334, ans=0.0 2023-10-12 09:32:34,931 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.26 vs. limit=6.0 2023-10-12 09:32:35,046 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.05 vs. limit=10.0 2023-10-12 09:32:51,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1013674.6666666666, ans=0.0 2023-10-12 09:33:02,225 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=12.0 2023-10-12 09:33:02,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.761e+02 1.928e+02 2.131e+02 2.863e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-12 09:33:03,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1013768.0, ans=0.125 2023-10-12 09:33:05,907 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:33:17,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1013814.6666666666, ans=0.0 2023-10-12 09:33:28,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013861.3333333334, ans=0.1 2023-10-12 09:33:45,540 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:33:51,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1013954.6666666666, ans=0.125 2023-10-12 09:33:55,740 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:34:09,178 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:34:10,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1014001.3333333334, ans=0.0 2023-10-12 09:34:10,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1014001.3333333334, ans=0.5 2023-10-12 09:34:13,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-10-12 09:34:17,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1014048.0, ans=0.0 2023-10-12 09:34:23,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1014094.6666666666, ans=0.125 2023-10-12 09:34:24,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1014094.6666666666, ans=0.125 2023-10-12 09:34:26,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1014094.6666666666, ans=0.0 2023-10-12 09:34:46,859 INFO [train.py:1031] (0/4) Epoch 16, batch 12500, loss[loss=0.2033, simple_loss=0.2956, pruned_loss=0.05549, over 16895.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2848, pruned_loss=0.05173, over 32721931.72 frames. ], batch size: 110, lr: 2.16e-03, grad_scale: 32.0 2023-10-12 09:34:47,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1014188.0, ans=0.125 2023-10-12 09:34:52,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1014188.0, ans=0.1 2023-10-12 09:34:57,621 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.760e+02 2.014e+02 2.324e+02 3.104e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-12 09:35:14,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1014281.3333333334, ans=0.125 2023-10-12 09:35:26,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.14 vs. limit=12.0 2023-10-12 09:35:48,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1014421.3333333334, ans=0.0 2023-10-12 09:36:08,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.28 vs. limit=15.0 2023-10-12 09:36:42,903 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=22.5 2023-10-12 09:36:46,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1014654.6666666666, ans=0.125 2023-10-12 09:36:50,237 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.670e+02 1.844e+02 2.021e+02 2.772e+02, threshold=3.688e+02, percent-clipped=0.0 2023-10-12 09:36:56,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1014701.3333333334, ans=0.125 2023-10-12 09:37:02,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1014748.0, ans=0.0 2023-10-12 09:37:07,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1014748.0, ans=0.0 2023-10-12 09:37:14,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-10-12 09:37:24,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1014841.3333333334, ans=0.1 2023-10-12 09:37:37,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1014888.0, ans=0.125 2023-10-12 09:38:05,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1014981.3333333334, ans=10.0 2023-10-12 09:38:12,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1015028.0, ans=0.125 2023-10-12 09:38:35,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1015121.3333333334, ans=0.0 2023-10-12 09:38:43,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.715e+02 1.910e+02 2.160e+02 3.053e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-12 09:38:48,705 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-10-12 09:39:01,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-10-12 09:39:15,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1015308.0, ans=0.125 2023-10-12 09:39:16,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-10-12 09:39:39,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1015401.3333333334, ans=0.0 2023-10-12 09:39:40,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1015401.3333333334, ans=0.95 2023-10-12 09:39:41,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1015401.3333333334, ans=0.125 2023-10-12 09:40:13,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1015494.6666666666, ans=0.125 2023-10-12 09:40:14,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1015494.6666666666, ans=0.1 2023-10-12 09:40:33,246 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-10-12 09:40:43,736 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.765e+02 1.949e+02 2.279e+02 3.238e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-12 09:40:50,347 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.06 vs. limit=22.5 2023-10-12 09:40:53,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1015681.3333333334, ans=0.125 2023-10-12 09:40:58,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1015681.3333333334, ans=0.025 2023-10-12 09:40:58,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1015681.3333333334, ans=0.125 2023-10-12 09:41:08,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1015728.0, ans=0.125 2023-10-12 09:41:45,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1015868.0, ans=0.2 2023-10-12 09:42:03,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1015961.3333333334, ans=0.0 2023-10-12 09:42:13,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1016008.0, ans=0.1 2023-10-12 09:42:18,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1016008.0, ans=0.1 2023-10-12 09:42:30,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.68 vs. limit=22.5 2023-10-12 09:42:37,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.688e+02 1.936e+02 2.265e+02 4.302e+02, threshold=3.871e+02, percent-clipped=1.0 2023-10-12 09:43:32,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1016334.6666666666, ans=0.0 2023-10-12 09:43:35,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1016334.6666666666, ans=0.0 2023-10-12 09:43:40,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1016334.6666666666, ans=0.125 2023-10-12 09:44:16,020 INFO [train.py:1031] (0/4) Epoch 16, batch 13000, loss[loss=0.1802, simple_loss=0.2817, pruned_loss=0.03941, over 16846.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.2856, pruned_loss=0.05193, over 32742064.94 frames. ], batch size: 87, lr: 2.15e-03, grad_scale: 16.0 2023-10-12 09:44:28,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.703e+02 1.845e+02 2.129e+02 2.723e+02, threshold=3.691e+02, percent-clipped=0.0 2023-10-12 09:44:41,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=15.0 2023-10-12 09:44:44,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.04 vs. limit=10.0 2023-10-12 09:44:49,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1016614.6666666666, ans=0.125 2023-10-12 09:45:32,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1016801.3333333334, ans=0.125 2023-10-12 09:45:46,117 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.75 vs. limit=22.5 2023-10-12 09:45:56,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1016848.0, ans=0.0 2023-10-12 09:46:02,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1016894.6666666666, ans=0.1 2023-10-12 09:46:34,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.724e+02 1.899e+02 2.148e+02 3.210e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-12 09:46:34,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017034.6666666666, ans=0.1 2023-10-12 09:46:52,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1017081.3333333334, ans=0.07 2023-10-12 09:47:09,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1017174.6666666666, ans=0.125 2023-10-12 09:47:24,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1017221.3333333334, ans=0.2 2023-10-12 09:48:09,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1017408.0, ans=0.125 2023-10-12 09:48:10,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1017408.0, ans=0.125 2023-10-12 09:48:22,225 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.64 vs. limit=15.0 2023-10-12 09:48:26,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1017454.6666666666, ans=0.1 2023-10-12 09:48:32,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1017501.3333333334, ans=0.125 2023-10-12 09:48:33,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.670e+02 1.905e+02 2.095e+02 2.895e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-12 09:48:51,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1017548.0, ans=0.125 2023-10-12 09:49:02,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1017594.6666666666, ans=0.05 2023-10-12 09:49:09,355 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:49:26,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1017734.6666666666, ans=0.0 2023-10-12 09:49:48,869 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 09:49:54,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1017828.0, ans=0.125 2023-10-12 09:50:03,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1017874.6666666666, ans=0.1 2023-10-12 09:50:09,788 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=12.0 2023-10-12 09:50:11,917 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.92 vs. limit=22.5 2023-10-12 09:50:15,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=15.0 2023-10-12 09:50:21,921 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.740e+02 1.869e+02 2.083e+02 3.079e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-12 09:50:22,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1017968.0, ans=0.125 2023-10-12 09:50:33,363 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.39 vs. limit=22.5 2023-10-12 09:50:35,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1018014.6666666666, ans=0.1 2023-10-12 09:51:04,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1018154.6666666666, ans=0.1 2023-10-12 09:51:16,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1018201.3333333334, ans=0.2 2023-10-12 09:51:21,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1018201.3333333334, ans=0.0 2023-10-12 09:51:28,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1018248.0, ans=0.1 2023-10-12 09:51:29,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1018248.0, ans=0.2 2023-10-12 09:51:36,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1018294.6666666666, ans=0.125 2023-10-12 09:51:43,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1018294.6666666666, ans=0.2 2023-10-12 09:51:47,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1018341.3333333334, ans=0.0 2023-10-12 09:51:51,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1018341.3333333334, ans=0.125 2023-10-12 09:51:54,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1018341.3333333334, ans=0.2 2023-10-12 09:51:55,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1018341.3333333334, ans=0.125 2023-10-12 09:51:58,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.81 vs. limit=12.0 2023-10-12 09:52:01,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.91 vs. limit=15.0 2023-10-12 09:52:04,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.68 vs. limit=15.0 2023-10-12 09:52:07,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1018388.0, ans=0.04949747468305833 2023-10-12 09:52:11,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.774e+02 1.938e+02 2.154e+02 3.080e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-12 09:52:30,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1018481.3333333334, ans=0.1 2023-10-12 09:52:47,206 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.98 vs. limit=6.0 2023-10-12 09:52:56,506 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.72 vs. limit=15.0 2023-10-12 09:53:19,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1018714.6666666666, ans=0.125 2023-10-12 09:53:39,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1018808.0, ans=0.125 2023-10-12 09:53:44,261 INFO [train.py:1031] (0/4) Epoch 16, batch 13500, loss[loss=0.222, simple_loss=0.307, pruned_loss=0.06851, over 16543.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.285, pruned_loss=0.05173, over 32770622.76 frames. ], batch size: 219, lr: 2.15e-03, grad_scale: 16.0 2023-10-12 09:53:57,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.710e+02 1.854e+02 2.039e+02 2.657e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 09:54:00,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1018901.3333333334, ans=0.125 2023-10-12 09:54:09,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1018948.0, ans=22.5 2023-10-12 09:54:13,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1018948.0, ans=0.1 2023-10-12 09:54:20,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2023-10-12 09:54:28,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1019041.3333333334, ans=0.125 2023-10-12 09:54:33,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019041.3333333334, ans=0.1 2023-10-12 09:54:34,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1019041.3333333334, ans=0.1 2023-10-12 09:54:38,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1019088.0, ans=0.0 2023-10-12 09:54:42,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-10-12 09:54:56,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1019134.6666666666, ans=0.2 2023-10-12 09:55:06,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1019181.3333333334, ans=0.0 2023-10-12 09:55:19,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.83 vs. limit=15.0 2023-10-12 09:55:20,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1019228.0, ans=15.0 2023-10-12 09:55:23,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1019274.6666666666, ans=0.125 2023-10-12 09:55:24,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1019274.6666666666, ans=0.0 2023-10-12 09:55:34,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1019321.3333333334, ans=0.0 2023-10-12 09:55:39,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1019321.3333333334, ans=0.125 2023-10-12 09:55:40,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1019321.3333333334, ans=0.125 2023-10-12 09:55:45,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.835e+02 2.092e+02 2.498e+02 3.722e+02, threshold=4.183e+02, percent-clipped=1.0 2023-10-12 09:55:58,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1019414.6666666666, ans=0.125 2023-10-12 09:56:04,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1019461.3333333334, ans=0.09899494936611666 2023-10-12 09:56:26,850 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-16.pt 2023-10-12 09:56:53,609 INFO [train.py:1031] (0/4) Epoch 17, batch 0, loss[loss=0.173, simple_loss=0.2605, pruned_loss=0.04276, over 16812.00 frames. ], tot_loss[loss=0.173, simple_loss=0.2605, pruned_loss=0.04276, over 16812.00 frames. ], batch size: 146, lr: 2.08e-03, grad_scale: 32.0 2023-10-12 09:56:53,610 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-12 09:56:57,529 INFO [zipformer.py:1853] (0/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.6043, 2.6968, 1.8622, 3.7634], device='cuda:0') 2023-10-12 09:57:00,662 INFO [train.py:1063] (0/4) Epoch 17, validation: loss=0.2156, simple_loss=0.3028, pruned_loss=0.06418, over 1020973.00 frames. 2023-10-12 09:57:00,663 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-12 09:57:21,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1019671.3333333334, ans=0.125 2023-10-12 09:57:22,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1019671.3333333334, ans=0.95 2023-10-12 09:57:23,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1019671.3333333334, ans=0.0 2023-10-12 09:57:38,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1019718.0, ans=0.0 2023-10-12 09:57:41,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1019718.0, ans=0.0 2023-10-12 09:57:42,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1019718.0, ans=0.125 2023-10-12 09:57:43,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.52 vs. limit=22.5 2023-10-12 09:57:54,906 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-10-12 09:57:56,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1019811.3333333334, ans=0.125 2023-10-12 09:58:04,262 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.354e+02 1.684e+02 1.856e+02 2.049e+02 2.950e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-12 09:58:09,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1019858.0, ans=0.125 2023-10-12 09:58:20,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019904.6666666666, ans=0.1 2023-10-12 09:58:26,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1019904.6666666666, ans=0.125 2023-10-12 09:58:30,011 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.51 vs. limit=15.0 2023-10-12 09:58:35,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019951.3333333334, ans=0.1 2023-10-12 09:58:42,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1019998.0, ans=0.0 2023-10-12 09:59:02,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1020091.3333333334, ans=0.04949747468305833 2023-10-12 09:59:03,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1020091.3333333334, ans=0.0 2023-10-12 09:59:36,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1020231.3333333334, ans=0.125 2023-10-12 09:59:54,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.324e+02 1.661e+02 1.802e+02 2.031e+02 3.220e+02, threshold=3.604e+02, percent-clipped=0.0 2023-10-12 10:00:11,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1020371.3333333334, ans=0.1 2023-10-12 10:00:21,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=12.0 2023-10-12 10:00:33,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1020464.6666666666, ans=0.125 2023-10-12 10:00:33,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1020464.6666666666, ans=0.125 2023-10-12 10:00:35,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.14 vs. limit=15.0 2023-10-12 10:00:42,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1020511.3333333334, ans=0.2 2023-10-12 10:00:42,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1020511.3333333334, ans=0.125 2023-10-12 10:00:53,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1020558.0, ans=10.0 2023-10-12 10:00:55,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1020558.0, ans=0.1 2023-10-12 10:00:56,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1020558.0, ans=0.125 2023-10-12 10:01:00,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1020604.6666666666, ans=0.2 2023-10-12 10:01:04,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1020604.6666666666, ans=0.1 2023-10-12 10:01:21,109 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:01:36,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1020744.6666666666, ans=0.125 2023-10-12 10:01:44,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1020744.6666666666, ans=0.125 2023-10-12 10:01:46,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.723e+02 1.860e+02 2.090e+02 2.958e+02, threshold=3.720e+02, percent-clipped=0.0 2023-10-12 10:01:53,076 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=12.0 2023-10-12 10:02:00,701 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=12.0 2023-10-12 10:02:07,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1020838.0, ans=0.2 2023-10-12 10:02:21,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1020931.3333333334, ans=0.125 2023-10-12 10:02:26,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=22.5 2023-10-12 10:02:27,629 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.71 vs. limit=15.0 2023-10-12 10:02:39,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1020978.0, ans=0.0 2023-10-12 10:02:39,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1020978.0, ans=0.5 2023-10-12 10:02:40,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.40 vs. limit=10.0 2023-10-12 10:03:06,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1021118.0, ans=0.125 2023-10-12 10:03:27,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1021211.3333333334, ans=0.95 2023-10-12 10:03:32,920 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.748e+02 1.959e+02 2.174e+02 2.956e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-12 10:03:36,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1021258.0, ans=0.0 2023-10-12 10:03:49,948 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-10-12 10:04:08,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1021398.0, ans=0.0 2023-10-12 10:04:14,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1021444.6666666666, ans=0.125 2023-10-12 10:04:21,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1021444.6666666666, ans=0.05 2023-10-12 10:04:21,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-10-12 10:04:22,843 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:04:33,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1021491.3333333334, ans=0.0 2023-10-12 10:04:33,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1021491.3333333334, ans=0.125 2023-10-12 10:04:37,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1021491.3333333334, ans=0.125 2023-10-12 10:04:39,550 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.77 vs. limit=10.0 2023-10-12 10:04:50,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1021584.6666666666, ans=0.125 2023-10-12 10:04:59,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-10-12 10:05:05,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1021631.3333333334, ans=0.125 2023-10-12 10:05:23,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.702e+02 1.870e+02 2.098e+02 3.003e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-12 10:05:30,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1021724.6666666666, ans=0.0 2023-10-12 10:05:31,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1021724.6666666666, ans=0.0 2023-10-12 10:05:34,234 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.73 vs. limit=10.0 2023-10-12 10:05:43,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1021771.3333333334, ans=0.2 2023-10-12 10:06:10,340 INFO [train.py:1031] (0/4) Epoch 17, batch 500, loss[loss=0.167, simple_loss=0.2647, pruned_loss=0.03464, over 16872.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2849, pruned_loss=0.0518, over 7281206.18 frames. ], batch size: 165, lr: 2.08e-03, grad_scale: 32.0 2023-10-12 10:06:14,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1021911.3333333334, ans=0.0 2023-10-12 10:06:18,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1021911.3333333334, ans=0.125 2023-10-12 10:06:32,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1022004.6666666666, ans=0.1 2023-10-12 10:06:34,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1022004.6666666666, ans=0.125 2023-10-12 10:06:59,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1022098.0, ans=0.0 2023-10-12 10:07:04,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-10-12 10:07:05,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1022144.6666666666, ans=0.1 2023-10-12 10:07:12,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.336e+02 1.788e+02 2.025e+02 2.302e+02 2.968e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-12 10:07:15,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1022191.3333333334, ans=0.125 2023-10-12 10:07:22,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1022191.3333333334, ans=0.0 2023-10-12 10:08:09,220 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-10-12 10:08:15,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1022424.6666666666, ans=0.125 2023-10-12 10:08:20,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1022471.3333333334, ans=0.0 2023-10-12 10:08:28,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1022471.3333333334, ans=0.2 2023-10-12 10:08:30,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1022518.0, ans=0.125 2023-10-12 10:08:47,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.38 vs. limit=15.0 2023-10-12 10:08:57,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1022611.3333333334, ans=0.1 2023-10-12 10:09:01,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.776e+02 1.916e+02 2.180e+02 3.143e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-12 10:09:08,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1022658.0, ans=0.125 2023-10-12 10:09:30,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1022751.3333333334, ans=0.125 2023-10-12 10:09:45,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1022844.6666666666, ans=0.125 2023-10-12 10:10:01,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1022891.3333333334, ans=0.0 2023-10-12 10:10:06,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=1022938.0, ans=12.0 2023-10-12 10:10:24,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-10-12 10:10:24,306 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2023-10-12 10:10:33,216 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.58 vs. limit=15.0 2023-10-12 10:10:36,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1023031.3333333334, ans=0.1 2023-10-12 10:10:51,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1023078.0, ans=0.1 2023-10-12 10:10:53,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.787e+02 1.944e+02 2.153e+02 3.291e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-12 10:11:07,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=22.5 2023-10-12 10:11:50,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1023358.0, ans=0.125 2023-10-12 10:12:19,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1023451.3333333334, ans=0.0 2023-10-12 10:12:20,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1023451.3333333334, ans=0.0 2023-10-12 10:12:28,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1023498.0, ans=0.0 2023-10-12 10:12:35,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1023544.6666666666, ans=0.125 2023-10-12 10:12:38,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1023544.6666666666, ans=0.0 2023-10-12 10:12:47,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.718e+02 1.856e+02 2.043e+02 4.318e+02, threshold=3.711e+02, percent-clipped=1.0 2023-10-12 10:12:52,542 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-10-12 10:13:04,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1023638.0, ans=0.125 2023-10-12 10:13:06,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1023638.0, ans=0.2 2023-10-12 10:13:10,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1023684.6666666666, ans=0.0 2023-10-12 10:13:16,139 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-10-12 10:13:36,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-10-12 10:13:53,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1023871.3333333334, ans=0.02 2023-10-12 10:14:08,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1023918.0, ans=0.125 2023-10-12 10:14:10,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1023918.0, ans=0.1 2023-10-12 10:14:11,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1023964.6666666666, ans=0.0 2023-10-12 10:14:28,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1024011.3333333334, ans=0.0 2023-10-12 10:14:36,539 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.700e+02 1.887e+02 2.175e+02 2.907e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-12 10:14:38,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-10-12 10:15:16,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1024198.0, ans=0.0 2023-10-12 10:15:19,571 INFO [train.py:1031] (0/4) Epoch 17, batch 1000, loss[loss=0.1837, simple_loss=0.2737, pruned_loss=0.0468, over 16945.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2852, pruned_loss=0.05183, over 12922493.15 frames. ], batch size: 138, lr: 2.08e-03, grad_scale: 16.0 2023-10-12 10:15:25,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1024244.6666666666, ans=0.125 2023-10-12 10:15:35,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1024291.3333333334, ans=0.125 2023-10-12 10:15:44,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1024338.0, ans=0.0 2023-10-12 10:15:45,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1024338.0, ans=0.125 2023-10-12 10:16:02,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=12.0 2023-10-12 10:16:03,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1024431.3333333334, ans=0.125 2023-10-12 10:16:12,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1024478.0, ans=0.125 2023-10-12 10:16:20,666 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.725e+02 1.898e+02 2.216e+02 3.881e+02, threshold=3.796e+02, percent-clipped=1.0 2023-10-12 10:16:31,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1024571.3333333334, ans=0.2 2023-10-12 10:16:34,352 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.55 vs. limit=15.0 2023-10-12 10:16:36,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1024571.3333333334, ans=0.125 2023-10-12 10:16:38,600 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:17:05,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1024711.3333333334, ans=0.125 2023-10-12 10:17:16,161 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.06 vs. limit=22.5 2023-10-12 10:17:34,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1024804.6666666666, ans=0.1 2023-10-12 10:17:35,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1024804.6666666666, ans=0.0 2023-10-12 10:17:56,964 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.23 vs. limit=5.0 2023-10-12 10:18:04,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1024944.6666666666, ans=0.0 2023-10-12 10:18:12,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1024991.3333333334, ans=0.0 2023-10-12 10:18:13,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.752e+02 1.936e+02 2.247e+02 3.675e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-12 10:18:33,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1025038.0, ans=0.125 2023-10-12 10:18:57,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1025131.3333333334, ans=0.125 2023-10-12 10:19:19,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1025224.6666666666, ans=0.125 2023-10-12 10:19:25,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1025224.6666666666, ans=0.1 2023-10-12 10:20:09,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.782e+02 1.935e+02 2.155e+02 3.106e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-12 10:20:23,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1025504.6666666666, ans=0.95 2023-10-12 10:20:30,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1025551.3333333334, ans=0.125 2023-10-12 10:21:13,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1025738.0, ans=0.125 2023-10-12 10:21:24,041 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:21:32,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=15.0 2023-10-12 10:21:40,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1025831.3333333334, ans=0.2 2023-10-12 10:21:47,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1025878.0, ans=0.1 2023-10-12 10:21:56,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.631e+02 1.861e+02 2.094e+02 2.940e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-12 10:22:25,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1026018.0, ans=0.0 2023-10-12 10:22:27,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.68 vs. limit=15.0 2023-10-12 10:22:38,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1026064.6666666666, ans=0.125 2023-10-12 10:22:41,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026111.3333333334, ans=0.1 2023-10-12 10:22:44,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1026111.3333333334, ans=0.0 2023-10-12 10:23:18,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1026251.3333333334, ans=0.0 2023-10-12 10:23:18,405 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.01 vs. limit=15.0 2023-10-12 10:23:22,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1026251.3333333334, ans=0.125 2023-10-12 10:23:33,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.64 vs. limit=15.0 2023-10-12 10:23:44,321 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=22.5 2023-10-12 10:23:49,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.744e+02 1.898e+02 2.138e+02 2.962e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-12 10:23:59,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1026438.0, ans=0.2 2023-10-12 10:24:17,674 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.40 vs. limit=5.0 2023-10-12 10:24:28,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1026531.3333333334, ans=0.0 2023-10-12 10:24:34,167 INFO [train.py:1031] (0/4) Epoch 17, batch 1500, loss[loss=0.1935, simple_loss=0.2804, pruned_loss=0.05327, over 16827.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.284, pruned_loss=0.05137, over 17314165.54 frames. ], batch size: 146, lr: 2.07e-03, grad_scale: 16.0 2023-10-12 10:24:50,481 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.90 vs. limit=6.0 2023-10-12 10:24:53,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1026624.6666666666, ans=0.2 2023-10-12 10:25:02,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1026671.3333333334, ans=0.0 2023-10-12 10:25:08,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026718.0, ans=0.1 2023-10-12 10:25:08,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1026718.0, ans=0.2 2023-10-12 10:25:18,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1026764.6666666666, ans=0.05 2023-10-12 10:25:21,436 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.78 vs. limit=22.5 2023-10-12 10:25:27,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1026811.3333333334, ans=0.1 2023-10-12 10:25:33,498 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:25:38,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.25 vs. limit=6.0 2023-10-12 10:25:41,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.758e+02 1.908e+02 2.078e+02 3.026e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-12 10:26:07,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026951.3333333334, ans=0.1 2023-10-12 10:26:11,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1026951.3333333334, ans=0.02 2023-10-12 10:26:23,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1026998.0, ans=0.1 2023-10-12 10:26:25,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027044.6666666666, ans=0.1 2023-10-12 10:26:35,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1027091.3333333334, ans=0.125 2023-10-12 10:26:40,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1027091.3333333334, ans=0.125 2023-10-12 10:26:49,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1027138.0, ans=0.125 2023-10-12 10:26:49,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1027138.0, ans=0.125 2023-10-12 10:26:52,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1027138.0, ans=0.0 2023-10-12 10:27:20,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027231.3333333334, ans=0.1 2023-10-12 10:27:26,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1027278.0, ans=0.0 2023-10-12 10:27:30,790 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.91 vs. limit=15.0 2023-10-12 10:27:39,725 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.720e+02 1.918e+02 2.202e+02 3.123e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-12 10:27:59,744 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.79 vs. limit=6.0 2023-10-12 10:27:59,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.07 vs. limit=15.0 2023-10-12 10:28:04,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1027418.0, ans=0.1 2023-10-12 10:28:24,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1027511.3333333334, ans=0.125 2023-10-12 10:28:26,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.63 vs. limit=10.0 2023-10-12 10:28:46,193 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.08 vs. limit=15.0 2023-10-12 10:28:49,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1027558.0, ans=0.0 2023-10-12 10:28:50,414 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.41 vs. limit=22.5 2023-10-12 10:29:16,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1027698.0, ans=0.125 2023-10-12 10:29:20,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1027698.0, ans=0.2 2023-10-12 10:29:22,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1027698.0, ans=0.1 2023-10-12 10:29:23,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1027744.6666666666, ans=0.125 2023-10-12 10:29:37,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.817e+02 1.981e+02 2.302e+02 3.285e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-12 10:29:48,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1027838.0, ans=0.125 2023-10-12 10:29:52,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.09 vs. limit=15.0 2023-10-12 10:29:57,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1027838.0, ans=0.0 2023-10-12 10:30:03,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1027884.6666666666, ans=0.05 2023-10-12 10:30:08,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1027884.6666666666, ans=0.05 2023-10-12 10:30:18,731 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-10-12 10:30:24,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1027978.0, ans=0.125 2023-10-12 10:30:41,998 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-10-12 10:30:44,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1028024.6666666666, ans=0.1 2023-10-12 10:31:09,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1028164.6666666666, ans=0.125 2023-10-12 10:31:12,883 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:31:15,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1028164.6666666666, ans=0.0 2023-10-12 10:31:19,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1028211.3333333334, ans=0.125 2023-10-12 10:31:21,098 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.72 vs. limit=15.0 2023-10-12 10:31:32,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.695e+02 1.850e+02 2.045e+02 3.272e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-12 10:32:24,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1028491.3333333334, ans=0.0 2023-10-12 10:32:37,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-10-12 10:32:42,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1028538.0, ans=0.2 2023-10-12 10:32:50,996 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:33:02,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1028631.3333333334, ans=0.125 2023-10-12 10:33:04,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1028631.3333333334, ans=0.125 2023-10-12 10:33:11,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1028631.3333333334, ans=0.125 2023-10-12 10:33:25,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.66 vs. limit=12.0 2023-10-12 10:33:29,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-10-12 10:33:29,756 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.725e+02 1.881e+02 2.102e+02 2.935e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-12 10:33:57,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1028818.0, ans=0.125 2023-10-12 10:34:04,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1028864.6666666666, ans=0.125 2023-10-12 10:34:17,138 INFO [train.py:1031] (0/4) Epoch 17, batch 2000, loss[loss=0.1834, simple_loss=0.2503, pruned_loss=0.05824, over 12216.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2846, pruned_loss=0.05141, over 20756001.85 frames. ], batch size: 440, lr: 2.07e-03, grad_scale: 32.0 2023-10-12 10:34:27,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1028911.3333333334, ans=0.125 2023-10-12 10:35:36,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1029191.3333333334, ans=0.025 2023-10-12 10:35:38,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.718e+02 1.871e+02 2.076e+02 2.686e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-12 10:36:03,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1029284.6666666666, ans=0.125 2023-10-12 10:36:03,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1029284.6666666666, ans=0.07 2023-10-12 10:36:07,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1029284.6666666666, ans=0.2 2023-10-12 10:36:22,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1029331.3333333334, ans=0.0 2023-10-12 10:36:49,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1029424.6666666666, ans=0.125 2023-10-12 10:37:17,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1029471.3333333334, ans=0.125 2023-10-12 10:37:39,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1029564.6666666666, ans=0.1 2023-10-12 10:37:40,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.02 vs. limit=22.5 2023-10-12 10:38:01,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.793e+02 1.992e+02 2.278e+02 3.646e+02, threshold=3.985e+02, percent-clipped=0.0 2023-10-12 10:38:28,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-10-12 10:38:36,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1029798.0, ans=0.04949747468305833 2023-10-12 10:38:45,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1029844.6666666666, ans=0.125 2023-10-12 10:38:51,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1029844.6666666666, ans=0.0 2023-10-12 10:38:54,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1029891.3333333334, ans=0.0 2023-10-12 10:39:17,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.84 vs. limit=15.0 2023-10-12 10:39:49,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1030124.6666666666, ans=0.125 2023-10-12 10:39:52,972 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.745e+02 1.889e+02 2.148e+02 3.098e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 10:40:02,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1030171.3333333334, ans=0.125 2023-10-12 10:40:19,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1030218.0, ans=0.0 2023-10-12 10:40:54,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1030404.6666666666, ans=0.125 2023-10-12 10:40:58,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1030404.6666666666, ans=0.125 2023-10-12 10:41:00,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1030404.6666666666, ans=0.125 2023-10-12 10:41:16,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1030498.0, ans=0.07 2023-10-12 10:41:19,390 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.09 vs. limit=15.0 2023-10-12 10:41:23,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1030498.0, ans=0.125 2023-10-12 10:41:40,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=1030591.3333333334, ans=22.5 2023-10-12 10:41:41,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.768e+02 1.933e+02 2.176e+02 3.096e+02, threshold=3.866e+02, percent-clipped=0.0 2023-10-12 10:41:53,313 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.32 vs. limit=12.0 2023-10-12 10:41:54,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1030638.0, ans=0.125 2023-10-12 10:41:56,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1030638.0, ans=0.125 2023-10-12 10:42:05,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1030684.6666666666, ans=0.0 2023-10-12 10:42:17,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1030731.3333333334, ans=0.1 2023-10-12 10:42:25,397 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=12.0 2023-10-12 10:42:43,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1030871.3333333334, ans=0.1 2023-10-12 10:42:53,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1030918.0, ans=0.0 2023-10-12 10:43:00,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1030918.0, ans=0.0 2023-10-12 10:43:01,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=1030918.0, ans=0.2 2023-10-12 10:43:01,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1030918.0, ans=0.0 2023-10-12 10:43:26,495 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=22.5 2023-10-12 10:43:30,446 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.772e+02 1.905e+02 2.104e+02 2.979e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 10:43:55,636 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.61 vs. limit=22.5 2023-10-12 10:44:01,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1031198.0, ans=0.125 2023-10-12 10:44:04,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1031198.0, ans=0.1 2023-10-12 10:44:09,737 INFO [train.py:1031] (0/4) Epoch 17, batch 2500, loss[loss=0.1865, simple_loss=0.2791, pruned_loss=0.04689, over 16954.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2849, pruned_loss=0.05164, over 23439056.16 frames. ], batch size: 165, lr: 2.07e-03, grad_scale: 32.0 2023-10-12 10:44:09,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1031244.6666666666, ans=0.5 2023-10-12 10:44:09,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1031244.6666666666, ans=0.0 2023-10-12 10:44:38,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1031338.0, ans=0.125 2023-10-12 10:44:39,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1031338.0, ans=0.0 2023-10-12 10:44:59,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1031431.3333333334, ans=0.2 2023-10-12 10:45:14,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1031524.6666666666, ans=0.125 2023-10-12 10:45:14,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.690e+02 1.830e+02 2.025e+02 2.731e+02, threshold=3.660e+02, percent-clipped=0.0 2023-10-12 10:45:27,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1031571.3333333334, ans=0.1 2023-10-12 10:45:55,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1031711.3333333334, ans=0.0 2023-10-12 10:45:55,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1031711.3333333334, ans=0.0 2023-10-12 10:46:06,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1031758.0, ans=0.125 2023-10-12 10:46:31,675 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=15.0 2023-10-12 10:46:57,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1031944.6666666666, ans=0.125 2023-10-12 10:46:57,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1031944.6666666666, ans=0.2 2023-10-12 10:47:04,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1031991.3333333334, ans=0.2 2023-10-12 10:47:05,373 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.714e+02 1.937e+02 2.131e+02 2.812e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-12 10:47:32,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1032084.6666666666, ans=0.125 2023-10-12 10:47:43,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1032131.3333333334, ans=0.125 2023-10-12 10:47:46,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1032178.0, ans=0.0 2023-10-12 10:48:00,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1032224.6666666666, ans=0.125 2023-10-12 10:48:10,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1032271.3333333334, ans=0.1 2023-10-12 10:48:15,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1032271.3333333334, ans=0.0 2023-10-12 10:48:22,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1032318.0, ans=0.0 2023-10-12 10:48:31,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1032318.0, ans=0.0 2023-10-12 10:48:41,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1032364.6666666666, ans=0.2 2023-10-12 10:49:03,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.766e+02 1.884e+02 2.185e+02 3.320e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-12 10:49:07,530 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.56 vs. limit=12.0 2023-10-12 10:49:13,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1032504.6666666666, ans=0.125 2023-10-12 10:49:48,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1032644.6666666666, ans=0.09899494936611666 2023-10-12 10:49:50,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1032644.6666666666, ans=0.125 2023-10-12 10:49:55,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1032644.6666666666, ans=0.0 2023-10-12 10:50:00,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1032691.3333333334, ans=0.0 2023-10-12 10:50:02,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.82 vs. limit=15.0 2023-10-12 10:50:03,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=1032691.3333333334, ans=0.05 2023-10-12 10:50:12,720 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:50:38,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1032831.3333333334, ans=0.125 2023-10-12 10:50:58,266 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:50:58,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1032924.6666666666, ans=0.125 2023-10-12 10:50:59,970 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.704e+02 1.886e+02 2.109e+02 2.690e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-12 10:51:18,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1032971.3333333334, ans=0.125 2023-10-12 10:51:28,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1033018.0, ans=0.0 2023-10-12 10:51:31,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1033018.0, ans=0.07 2023-10-12 10:51:31,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1033018.0, ans=0.125 2023-10-12 10:51:57,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1033111.3333333334, ans=0.2 2023-10-12 10:52:09,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-10-12 10:52:13,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.92 vs. limit=15.0 2023-10-12 10:52:32,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1033251.3333333334, ans=0.1 2023-10-12 10:52:32,803 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2023-10-12 10:52:37,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.11 vs. limit=22.5 2023-10-12 10:52:41,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1033298.0, ans=0.95 2023-10-12 10:52:45,368 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=15.0 2023-10-12 10:52:49,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.79 vs. limit=6.0 2023-10-12 10:52:51,308 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.85 vs. limit=22.5 2023-10-12 10:52:59,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.761e+02 1.991e+02 2.289e+02 3.056e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-12 10:53:02,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1033391.3333333334, ans=0.125 2023-10-12 10:53:27,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1033484.6666666666, ans=0.2 2023-10-12 10:53:30,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1033531.3333333334, ans=0.125 2023-10-12 10:53:34,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1033531.3333333334, ans=0.1 2023-10-12 10:53:35,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1033531.3333333334, ans=0.125 2023-10-12 10:53:40,317 INFO [train.py:1031] (0/4) Epoch 17, batch 3000, loss[loss=0.1866, simple_loss=0.2813, pruned_loss=0.04597, over 16915.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2842, pruned_loss=0.05166, over 25505209.77 frames. ], batch size: 165, lr: 2.07e-03, grad_scale: 32.0 2023-10-12 10:53:43,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1033578.0, ans=0.125 2023-10-12 10:53:43,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1033578.0, ans=0.125 2023-10-12 10:53:52,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1033624.6666666666, ans=0.0 2023-10-12 10:54:00,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1033671.3333333334, ans=0.0 2023-10-12 10:54:28,514 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.10 vs. limit=10.0 2023-10-12 10:54:29,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1033764.6666666666, ans=0.0 2023-10-12 10:54:53,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.691e+02 1.872e+02 2.145e+02 3.353e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-12 10:54:55,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.26 vs. limit=15.0 2023-10-12 10:55:17,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1033951.3333333334, ans=0.2 2023-10-12 10:55:19,521 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-10-12 10:55:37,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1034044.6666666666, ans=0.0 2023-10-12 10:55:45,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1034044.6666666666, ans=0.04949747468305833 2023-10-12 10:56:07,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1034138.0, ans=0.1 2023-10-12 10:56:11,921 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.26 vs. limit=15.0 2023-10-12 10:56:12,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1034184.6666666666, ans=0.07 2023-10-12 10:56:46,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1034324.6666666666, ans=0.125 2023-10-12 10:56:49,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.763e+02 1.878e+02 2.177e+02 2.913e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-12 10:57:01,741 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:57:03,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1034371.3333333334, ans=0.0 2023-10-12 10:57:24,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1034464.6666666666, ans=0.125 2023-10-12 10:57:26,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1034464.6666666666, ans=0.125 2023-10-12 10:57:30,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1034511.3333333334, ans=0.0 2023-10-12 10:57:31,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1034511.3333333334, ans=0.125 2023-10-12 10:57:33,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1034511.3333333334, ans=0.0 2023-10-12 10:57:34,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1034511.3333333334, ans=0.125 2023-10-12 10:57:41,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1034558.0, ans=0.0 2023-10-12 10:57:55,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1034604.6666666666, ans=0.1 2023-10-12 10:58:07,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.20 vs. limit=10.0 2023-10-12 10:58:15,349 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.56 vs. limit=15.0 2023-10-12 10:58:16,892 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=12.0 2023-10-12 10:58:34,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=12.0 2023-10-12 10:58:54,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.784e+02 2.025e+02 2.309e+02 3.954e+02, threshold=4.051e+02, percent-clipped=1.0 2023-10-12 10:58:57,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1034791.3333333334, ans=0.125 2023-10-12 10:59:02,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1034838.0, ans=0.2 2023-10-12 10:59:09,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1034838.0, ans=0.04949747468305833 2023-10-12 10:59:16,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-10-12 10:59:17,967 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 10:59:51,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.58 vs. limit=15.0 2023-10-12 11:00:01,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1035071.3333333334, ans=0.125 2023-10-12 11:00:14,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1035118.0, ans=0.125 2023-10-12 11:00:50,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.759e+02 1.884e+02 2.050e+02 3.041e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-12 11:00:53,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.72 vs. limit=12.0 2023-10-12 11:00:58,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1035304.6666666666, ans=0.1 2023-10-12 11:01:09,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.18 vs. limit=15.0 2023-10-12 11:01:09,394 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.75 vs. limit=22.5 2023-10-12 11:01:27,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2023-10-12 11:01:42,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1035444.6666666666, ans=0.125 2023-10-12 11:01:46,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1035491.3333333334, ans=0.1 2023-10-12 11:01:55,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1035538.0, ans=0.1 2023-10-12 11:01:57,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1035538.0, ans=0.07 2023-10-12 11:01:59,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1035538.0, ans=0.02 2023-10-12 11:02:21,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1035631.3333333334, ans=0.0 2023-10-12 11:02:23,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1035631.3333333334, ans=0.125 2023-10-12 11:02:29,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1035678.0, ans=0.125 2023-10-12 11:02:33,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1035678.0, ans=0.0 2023-10-12 11:02:34,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.09 vs. limit=22.5 2023-10-12 11:02:43,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.708e+02 1.895e+02 2.107e+02 2.728e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-12 11:02:52,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1035771.3333333334, ans=0.2 2023-10-12 11:03:08,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1035818.0, ans=0.1 2023-10-12 11:03:08,503 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-10-12 11:03:08,687 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.76 vs. limit=15.0 2023-10-12 11:03:12,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1035864.6666666666, ans=0.0 2023-10-12 11:03:16,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1035864.6666666666, ans=0.2 2023-10-12 11:03:24,196 INFO [train.py:1031] (0/4) Epoch 17, batch 3500, loss[loss=0.2001, simple_loss=0.2951, pruned_loss=0.05256, over 16948.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2839, pruned_loss=0.05163, over 27091805.00 frames. ], batch size: 138, lr: 2.07e-03, grad_scale: 16.0 2023-10-12 11:03:38,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1035958.0, ans=0.125 2023-10-12 11:03:38,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1035958.0, ans=0.125 2023-10-12 11:03:51,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1036004.6666666666, ans=0.125 2023-10-12 11:03:54,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1036004.6666666666, ans=0.125 2023-10-12 11:04:25,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1036144.6666666666, ans=0.125 2023-10-12 11:04:34,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.764e+02 2.002e+02 2.247e+02 3.370e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-12 11:04:35,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1036191.3333333334, ans=0.0 2023-10-12 11:04:48,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1036238.0, ans=0.0 2023-10-12 11:04:49,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1036238.0, ans=0.125 2023-10-12 11:05:30,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1036424.6666666666, ans=0.1 2023-10-12 11:05:32,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1036424.6666666666, ans=0.125 2023-10-12 11:05:47,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1036471.3333333334, ans=0.0 2023-10-12 11:05:49,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1036471.3333333334, ans=0.0 2023-10-12 11:06:06,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1036564.6666666666, ans=0.0 2023-10-12 11:06:38,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1036611.3333333334, ans=0.0 2023-10-12 11:06:48,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.762e+02 1.985e+02 2.223e+02 3.035e+02, threshold=3.970e+02, percent-clipped=0.0 2023-10-12 11:06:58,544 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.02 vs. limit=15.0 2023-10-12 11:07:11,729 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.19 vs. limit=15.0 2023-10-12 11:07:12,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036751.3333333334, ans=0.1 2023-10-12 11:07:14,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1036798.0, ans=0.125 2023-10-12 11:07:18,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1036798.0, ans=0.0 2023-10-12 11:07:27,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1036844.6666666666, ans=0.07 2023-10-12 11:07:36,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1036844.6666666666, ans=0.0 2023-10-12 11:08:15,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1036984.6666666666, ans=0.0 2023-10-12 11:08:34,966 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.44 vs. limit=15.0 2023-10-12 11:08:48,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.673e+02 1.803e+02 1.953e+02 3.169e+02, threshold=3.606e+02, percent-clipped=0.0 2023-10-12 11:08:57,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1037171.3333333334, ans=0.125 2023-10-12 11:09:00,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1037171.3333333334, ans=0.2 2023-10-12 11:09:12,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1037218.0, ans=0.125 2023-10-12 11:09:21,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1037264.6666666666, ans=0.0 2023-10-12 11:09:32,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1037311.3333333334, ans=0.1 2023-10-12 11:09:34,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-10-12 11:09:38,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1037311.3333333334, ans=0.0 2023-10-12 11:09:53,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1037404.6666666666, ans=0.2 2023-10-12 11:10:05,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1037451.3333333334, ans=0.125 2023-10-12 11:10:06,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1037451.3333333334, ans=0.0 2023-10-12 11:10:10,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1037451.3333333334, ans=0.0 2023-10-12 11:10:11,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1037451.3333333334, ans=0.125 2023-10-12 11:10:20,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1037498.0, ans=0.125 2023-10-12 11:10:21,228 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2023-10-12 11:10:31,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037544.6666666666, ans=0.1 2023-10-12 11:10:36,479 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.58 vs. limit=15.0 2023-10-12 11:10:41,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.55 vs. limit=15.0 2023-10-12 11:10:43,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.656e+02 1.859e+02 2.018e+02 3.077e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-12 11:11:18,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-12 11:11:20,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-10-12 11:11:22,745 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.57 vs. limit=15.0 2023-10-12 11:11:29,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1037778.0, ans=0.0 2023-10-12 11:11:44,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1037871.3333333334, ans=0.125 2023-10-12 11:11:53,121 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-10-12 11:12:03,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1037964.6666666666, ans=0.125 2023-10-12 11:12:05,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1037964.6666666666, ans=0.125 2023-10-12 11:12:08,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1037964.6666666666, ans=10.0 2023-10-12 11:12:09,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.78 vs. limit=15.0 2023-10-12 11:12:10,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037964.6666666666, ans=0.1 2023-10-12 11:12:14,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1038011.3333333334, ans=0.0 2023-10-12 11:12:31,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.656e+02 1.834e+02 2.040e+02 2.876e+02, threshold=3.667e+02, percent-clipped=0.0 2023-10-12 11:12:36,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1038104.6666666666, ans=0.125 2023-10-12 11:12:52,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1038151.3333333334, ans=0.2 2023-10-12 11:13:03,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1038198.0, ans=0.0 2023-10-12 11:13:09,857 INFO [train.py:1031] (0/4) Epoch 17, batch 4000, loss[loss=0.1861, simple_loss=0.2887, pruned_loss=0.04173, over 16597.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2836, pruned_loss=0.0517, over 28358567.08 frames. ], batch size: 219, lr: 2.06e-03, grad_scale: 32.0 2023-10-12 11:13:29,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1038291.3333333334, ans=0.125 2023-10-12 11:13:49,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1038384.6666666666, ans=0.125 2023-10-12 11:13:54,035 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.00 vs. limit=15.0 2023-10-12 11:14:13,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1038478.0, ans=15.0 2023-10-12 11:14:21,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1038524.6666666666, ans=0.2 2023-10-12 11:14:25,437 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.041e-02 2023-10-12 11:14:26,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.827e+02 2.015e+02 2.319e+02 3.129e+02, threshold=4.030e+02, percent-clipped=0.0 2023-10-12 11:14:34,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1038571.3333333334, ans=0.0 2023-10-12 11:14:40,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1038571.3333333334, ans=0.0 2023-10-12 11:14:45,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1038618.0, ans=0.125 2023-10-12 11:15:38,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1038804.6666666666, ans=0.125 2023-10-12 11:16:23,725 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.747e+02 1.921e+02 2.112e+02 3.895e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-12 11:16:30,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1039038.0, ans=0.1 2023-10-12 11:16:46,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1039084.6666666666, ans=0.125 2023-10-12 11:17:15,224 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:17:31,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1039224.6666666666, ans=0.125 2023-10-12 11:17:31,926 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-10-12 11:17:38,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1039271.3333333334, ans=0.125 2023-10-12 11:17:42,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1039271.3333333334, ans=0.125 2023-10-12 11:17:47,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1039318.0, ans=0.125 2023-10-12 11:17:53,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1039318.0, ans=0.125 2023-10-12 11:17:59,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1039364.6666666666, ans=0.2 2023-10-12 11:18:17,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1039458.0, ans=0.125 2023-10-12 11:18:22,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1039458.0, ans=0.1 2023-10-12 11:18:24,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.699e+02 1.826e+02 2.016e+02 3.213e+02, threshold=3.653e+02, percent-clipped=0.0 2023-10-12 11:18:35,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1039504.6666666666, ans=0.125 2023-10-12 11:18:45,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1039551.3333333334, ans=0.125 2023-10-12 11:18:52,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.51 vs. limit=15.0 2023-10-12 11:18:52,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1039598.0, ans=0.09899494936611666 2023-10-12 11:18:59,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1039644.6666666666, ans=0.1 2023-10-12 11:18:59,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1039644.6666666666, ans=0.1 2023-10-12 11:19:05,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1039644.6666666666, ans=0.125 2023-10-12 11:19:16,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1039691.3333333334, ans=0.0 2023-10-12 11:19:30,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1039784.6666666666, ans=0.2 2023-10-12 11:19:45,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1039831.3333333334, ans=0.0 2023-10-12 11:19:45,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=12.0 2023-10-12 11:19:52,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1039878.0, ans=0.125 2023-10-12 11:20:01,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1039878.0, ans=0.125 2023-10-12 11:20:08,545 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:20:11,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.905e+02 2.123e+02 2.459e+02 3.376e+02, threshold=4.245e+02, percent-clipped=0.0 2023-10-12 11:20:16,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1039971.3333333334, ans=0.125 2023-10-12 11:20:28,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1040018.0, ans=0.0 2023-10-12 11:20:32,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1040018.0, ans=0.5 2023-10-12 11:20:35,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1040018.0, ans=0.125 2023-10-12 11:20:50,125 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.95 vs. limit=22.5 2023-10-12 11:21:09,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1040158.0, ans=0.125 2023-10-12 11:21:14,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1040204.6666666666, ans=0.0 2023-10-12 11:21:16,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1040204.6666666666, ans=0.125 2023-10-12 11:21:30,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1040251.3333333334, ans=0.0 2023-10-12 11:21:30,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1040251.3333333334, ans=0.1 2023-10-12 11:21:56,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1040344.6666666666, ans=0.125 2023-10-12 11:22:11,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1040391.3333333334, ans=0.5 2023-10-12 11:22:11,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1040391.3333333334, ans=0.0 2023-10-12 11:22:11,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1040391.3333333334, ans=0.125 2023-10-12 11:22:16,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.889e+02 2.051e+02 2.231e+02 3.377e+02, threshold=4.102e+02, percent-clipped=0.0 2023-10-12 11:22:19,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1040438.0, ans=0.0 2023-10-12 11:22:36,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1040484.6666666666, ans=0.125 2023-10-12 11:22:37,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1040484.6666666666, ans=0.0 2023-10-12 11:22:46,339 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=12.0 2023-10-12 11:22:51,829 INFO [train.py:1031] (0/4) Epoch 17, batch 4500, loss[loss=0.2014, simple_loss=0.2915, pruned_loss=0.05563, over 16986.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.284, pruned_loss=0.05165, over 29334816.44 frames. ], batch size: 123, lr: 2.06e-03, grad_scale: 32.0 2023-10-12 11:23:06,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1040624.6666666666, ans=0.125 2023-10-12 11:23:29,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1040718.0, ans=0.95 2023-10-12 11:23:57,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1040858.0, ans=0.2 2023-10-12 11:24:01,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.698e+02 1.862e+02 2.005e+02 3.191e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-12 11:24:02,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1040858.0, ans=0.09899494936611666 2023-10-12 11:24:12,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1040904.6666666666, ans=0.2 2023-10-12 11:24:17,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1040951.3333333334, ans=0.0 2023-10-12 11:24:24,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1040951.3333333334, ans=0.0 2023-10-12 11:24:27,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1040998.0, ans=0.0 2023-10-12 11:24:44,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1041044.6666666666, ans=0.125 2023-10-12 11:24:56,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1041091.3333333334, ans=0.125 2023-10-12 11:25:00,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1041138.0, ans=0.2 2023-10-12 11:25:01,247 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.72 vs. limit=22.5 2023-10-12 11:25:04,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1041138.0, ans=0.2 2023-10-12 11:25:29,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1041278.0, ans=0.125 2023-10-12 11:25:36,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1041278.0, ans=0.2 2023-10-12 11:25:37,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1041278.0, ans=0.125 2023-10-12 11:25:48,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.771e+02 1.978e+02 2.267e+02 3.246e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-12 11:26:01,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1041371.3333333334, ans=0.125 2023-10-12 11:26:10,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1041418.0, ans=0.0 2023-10-12 11:26:11,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1041418.0, ans=0.125 2023-10-12 11:26:40,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1041558.0, ans=0.1 2023-10-12 11:27:00,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1041651.3333333334, ans=0.0 2023-10-12 11:27:08,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1041698.0, ans=0.0 2023-10-12 11:27:09,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1041698.0, ans=0.1 2023-10-12 11:27:17,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1041744.6666666666, ans=0.0 2023-10-12 11:27:20,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1041744.6666666666, ans=0.125 2023-10-12 11:27:25,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1041744.6666666666, ans=0.0 2023-10-12 11:27:30,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.26 vs. limit=15.0 2023-10-12 11:27:36,044 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.770e+02 1.956e+02 2.280e+02 3.716e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-12 11:27:51,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1041884.6666666666, ans=0.2 2023-10-12 11:27:51,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1041884.6666666666, ans=0.125 2023-10-12 11:28:14,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1041978.0, ans=0.04949747468305833 2023-10-12 11:28:15,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1041978.0, ans=0.125 2023-10-12 11:29:02,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1042164.6666666666, ans=0.02 2023-10-12 11:29:24,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1042258.0, ans=0.1 2023-10-12 11:29:29,487 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.687e+02 1.853e+02 2.066e+02 3.511e+02, threshold=3.706e+02, percent-clipped=0.0 2023-10-12 11:29:32,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1042304.6666666666, ans=0.1 2023-10-12 11:29:36,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1042304.6666666666, ans=0.125 2023-10-12 11:30:09,585 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:30:42,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1042584.6666666666, ans=0.125 2023-10-12 11:30:53,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1042631.3333333334, ans=0.07 2023-10-12 11:30:55,620 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:30:58,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1042631.3333333334, ans=0.125 2023-10-12 11:31:05,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1042678.0, ans=0.2 2023-10-12 11:31:22,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.710e+02 1.844e+02 2.083e+02 2.836e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-12 11:31:58,923 INFO [train.py:1031] (0/4) Epoch 17, batch 5000, loss[loss=0.1865, simple_loss=0.2722, pruned_loss=0.05041, over 16553.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2837, pruned_loss=0.05187, over 30081600.16 frames. ], batch size: 61, lr: 2.06e-03, grad_scale: 32.0 2023-10-12 11:32:08,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1042911.3333333334, ans=0.125 2023-10-12 11:32:11,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1042958.0, ans=0.125 2023-10-12 11:32:12,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1042958.0, ans=0.2 2023-10-12 11:32:25,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1043004.6666666666, ans=0.1 2023-10-12 11:32:49,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1043098.0, ans=0.0 2023-10-12 11:32:51,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1043098.0, ans=0.125 2023-10-12 11:32:52,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1043098.0, ans=0.07 2023-10-12 11:33:00,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-10-12 11:33:13,678 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.790e+02 1.935e+02 2.178e+02 3.079e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 11:33:17,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1043238.0, ans=0.125 2023-10-12 11:33:26,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1043284.6666666666, ans=0.125 2023-10-12 11:33:29,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1043284.6666666666, ans=0.125 2023-10-12 11:33:33,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1043284.6666666666, ans=0.07 2023-10-12 11:33:47,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1043331.3333333334, ans=0.125 2023-10-12 11:33:54,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1043331.3333333334, ans=0.0 2023-10-12 11:33:54,191 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-10-12 11:34:00,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.01 vs. limit=15.0 2023-10-12 11:34:10,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1043424.6666666666, ans=0.0 2023-10-12 11:34:14,241 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.17 vs. limit=15.0 2023-10-12 11:34:18,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1043471.3333333334, ans=0.125 2023-10-12 11:34:29,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1043518.0, ans=0.125 2023-10-12 11:35:04,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1043658.0, ans=0.125 2023-10-12 11:35:11,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.704e+02 1.934e+02 2.176e+02 3.621e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-12 11:35:15,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1043704.6666666666, ans=0.0 2023-10-12 11:35:36,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1043798.0, ans=0.0 2023-10-12 11:35:38,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.02 vs. limit=10.0 2023-10-12 11:36:00,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1043891.3333333334, ans=0.2 2023-10-12 11:36:21,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1043984.6666666666, ans=0.125 2023-10-12 11:36:24,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1043984.6666666666, ans=0.125 2023-10-12 11:36:31,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1044031.3333333334, ans=0.1 2023-10-12 11:36:52,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1044124.6666666666, ans=0.125 2023-10-12 11:37:01,457 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.727e+02 1.868e+02 2.066e+02 2.891e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-12 11:37:09,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1044171.3333333334, ans=0.125 2023-10-12 11:37:14,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1044218.0, ans=0.2 2023-10-12 11:37:18,642 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.17 vs. limit=15.0 2023-10-12 11:37:30,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1044264.6666666666, ans=0.0 2023-10-12 11:37:34,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1044264.6666666666, ans=0.0 2023-10-12 11:37:54,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1044358.0, ans=0.0 2023-10-12 11:37:56,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1044358.0, ans=0.125 2023-10-12 11:38:26,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1044498.0, ans=0.2 2023-10-12 11:38:35,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1044544.6666666666, ans=0.125 2023-10-12 11:38:56,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.661e+02 1.857e+02 2.055e+02 2.845e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-12 11:39:00,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1044638.0, ans=0.125 2023-10-12 11:39:22,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1044731.3333333334, ans=0.1 2023-10-12 11:39:50,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.51 vs. limit=15.0 2023-10-12 11:40:15,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1044964.6666666666, ans=0.125 2023-10-12 11:40:17,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1044964.6666666666, ans=0.125 2023-10-12 11:40:42,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.700e+02 1.905e+02 2.254e+02 3.297e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-12 11:40:48,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1045104.6666666666, ans=0.1 2023-10-12 11:40:52,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1045151.3333333334, ans=0.125 2023-10-12 11:41:11,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1045198.0, ans=0.125 2023-10-12 11:41:17,187 INFO [train.py:1031] (0/4) Epoch 17, batch 5500, loss[loss=0.1966, simple_loss=0.2907, pruned_loss=0.05129, over 16961.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2836, pruned_loss=0.05171, over 30676976.38 frames. ], batch size: 82, lr: 2.06e-03, grad_scale: 16.0 2023-10-12 11:41:30,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1045291.3333333334, ans=0.0 2023-10-12 11:41:36,187 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-224000.pt 2023-10-12 11:41:40,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1045338.0, ans=0.125 2023-10-12 11:41:44,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1045338.0, ans=0.5 2023-10-12 11:41:45,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1045338.0, ans=0.125 2023-10-12 11:41:45,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1045338.0, ans=0.125 2023-10-12 11:41:58,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1045384.6666666666, ans=0.04949747468305833 2023-10-12 11:41:58,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1045384.6666666666, ans=0.07 2023-10-12 11:42:25,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1045524.6666666666, ans=0.0 2023-10-12 11:42:26,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1045524.6666666666, ans=0.125 2023-10-12 11:42:34,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.668e+02 1.785e+02 1.937e+02 2.713e+02, threshold=3.570e+02, percent-clipped=0.0 2023-10-12 11:42:58,596 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:43:00,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.68 vs. limit=22.5 2023-10-12 11:43:07,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1045711.3333333334, ans=0.1 2023-10-12 11:43:21,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1045758.0, ans=0.0 2023-10-12 11:43:25,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1045758.0, ans=0.1 2023-10-12 11:43:34,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1045804.6666666666, ans=0.1 2023-10-12 11:43:34,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1045804.6666666666, ans=0.0 2023-10-12 11:43:35,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1045804.6666666666, ans=0.0 2023-10-12 11:43:46,756 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.00 vs. limit=22.5 2023-10-12 11:43:58,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1045944.6666666666, ans=10.0 2023-10-12 11:44:00,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1045944.6666666666, ans=0.2 2023-10-12 11:44:03,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1045944.6666666666, ans=0.125 2023-10-12 11:44:07,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1045944.6666666666, ans=0.2 2023-10-12 11:44:12,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1045991.3333333334, ans=0.07 2023-10-12 11:44:22,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.722e+02 1.901e+02 2.122e+02 2.617e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-12 11:44:23,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1046038.0, ans=0.125 2023-10-12 11:44:40,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1046084.6666666666, ans=0.05 2023-10-12 11:44:41,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-10-12 11:44:58,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1046178.0, ans=0.125 2023-10-12 11:45:06,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1046178.0, ans=0.2 2023-10-12 11:45:10,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1046224.6666666666, ans=0.2 2023-10-12 11:45:22,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.09 vs. limit=10.0 2023-10-12 11:45:32,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1046318.0, ans=0.125 2023-10-12 11:45:36,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1046318.0, ans=0.2 2023-10-12 11:45:56,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1046411.3333333334, ans=0.125 2023-10-12 11:46:04,344 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:46:14,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.768e+02 1.984e+02 2.257e+02 2.986e+02, threshold=3.967e+02, percent-clipped=0.0 2023-10-12 11:46:16,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1046504.6666666666, ans=0.0 2023-10-12 11:46:19,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1046504.6666666666, ans=0.125 2023-10-12 11:46:25,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1046551.3333333334, ans=0.125 2023-10-12 11:46:30,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.54 vs. limit=15.0 2023-10-12 11:46:49,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1046644.6666666666, ans=0.125 2023-10-12 11:47:05,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1046691.3333333334, ans=0.09899494936611666 2023-10-12 11:47:24,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1046784.6666666666, ans=0.1 2023-10-12 11:47:25,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1046784.6666666666, ans=0.125 2023-10-12 11:47:38,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.86 vs. limit=15.0 2023-10-12 11:47:40,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1046831.3333333334, ans=0.125 2023-10-12 11:47:46,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1046878.0, ans=0.125 2023-10-12 11:47:48,774 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.13 vs. limit=15.0 2023-10-12 11:47:57,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1046924.6666666666, ans=0.125 2023-10-12 11:48:02,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1046924.6666666666, ans=0.0 2023-10-12 11:48:08,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.810e+02 2.041e+02 2.357e+02 3.009e+02, threshold=4.081e+02, percent-clipped=0.0 2023-10-12 11:48:13,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1046971.3333333334, ans=0.1 2023-10-12 11:48:19,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=8.0 2023-10-12 11:48:57,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1047158.0, ans=0.1 2023-10-12 11:49:25,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1047251.3333333334, ans=0.0 2023-10-12 11:49:33,817 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.39 vs. limit=15.0 2023-10-12 11:49:41,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2023-10-12 11:49:45,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1047344.6666666666, ans=0.125 2023-10-12 11:49:49,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1047344.6666666666, ans=0.2 2023-10-12 11:49:56,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1047391.3333333334, ans=0.125 2023-10-12 11:50:03,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1047391.3333333334, ans=0.125 2023-10-12 11:50:07,125 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.789e+02 1.936e+02 2.151e+02 3.140e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-12 11:50:07,691 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.60 vs. limit=15.0 2023-10-12 11:50:28,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=15.0 2023-10-12 11:50:29,129 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.31 vs. limit=15.0 2023-10-12 11:50:35,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1047531.3333333334, ans=0.125 2023-10-12 11:50:36,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1047531.3333333334, ans=0.0 2023-10-12 11:50:39,718 INFO [train.py:1031] (0/4) Epoch 17, batch 6000, loss[loss=0.2167, simple_loss=0.2958, pruned_loss=0.06882, over 15991.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.284, pruned_loss=0.05206, over 31122678.08 frames. ], batch size: 296, lr: 2.05e-03, grad_scale: 32.0 2023-10-12 11:50:59,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1047624.6666666666, ans=0.0 2023-10-12 11:51:15,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1047718.0, ans=0.125 2023-10-12 11:51:19,146 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.05 vs. limit=6.0 2023-10-12 11:51:37,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1047811.3333333334, ans=0.0 2023-10-12 11:51:43,930 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.56 vs. limit=22.5 2023-10-12 11:51:46,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1047811.3333333334, ans=0.1 2023-10-12 11:51:58,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1047858.0, ans=0.125 2023-10-12 11:51:59,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1047904.6666666666, ans=0.125 2023-10-12 11:51:59,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1047904.6666666666, ans=0.125 2023-10-12 11:52:00,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.730e+02 1.930e+02 2.198e+02 2.972e+02, threshold=3.861e+02, percent-clipped=0.0 2023-10-12 11:52:07,501 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:52:27,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1047998.0, ans=0.125 2023-10-12 11:52:56,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1048138.0, ans=0.125 2023-10-12 11:53:03,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1048184.6666666666, ans=0.125 2023-10-12 11:53:35,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1048324.6666666666, ans=0.1 2023-10-12 11:53:45,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1048324.6666666666, ans=0.125 2023-10-12 11:53:48,752 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.745e+02 1.902e+02 2.152e+02 3.327e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-12 11:53:59,284 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.65 vs. limit=15.0 2023-10-12 11:54:04,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1048418.0, ans=0.0 2023-10-12 11:54:07,937 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:54:26,406 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.82 vs. limit=15.0 2023-10-12 11:54:32,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1048558.0, ans=0.2 2023-10-12 11:54:55,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1048651.3333333333, ans=0.125 2023-10-12 11:54:56,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1048651.3333333333, ans=0.1 2023-10-12 11:55:18,959 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-10-12 11:55:20,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.68 vs. limit=15.0 2023-10-12 11:55:21,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1048744.6666666667, ans=0.0 2023-10-12 11:55:23,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1048744.6666666667, ans=0.1 2023-10-12 11:55:31,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1048791.3333333333, ans=0.07 2023-10-12 11:55:34,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1048791.3333333333, ans=0.125 2023-10-12 11:55:34,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1048791.3333333333, ans=0.0 2023-10-12 11:55:40,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.786e+02 1.934e+02 2.131e+02 2.678e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 11:55:40,863 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.04 vs. limit=6.0 2023-10-12 11:55:45,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1048838.0, ans=0.125 2023-10-12 11:56:16,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1048978.0, ans=0.125 2023-10-12 11:56:26,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1049024.6666666667, ans=0.125 2023-10-12 11:57:01,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1049164.6666666667, ans=0.125 2023-10-12 11:57:31,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1049258.0, ans=0.0 2023-10-12 11:57:34,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1049258.0, ans=0.125 2023-10-12 11:57:35,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1049258.0, ans=0.125 2023-10-12 11:57:40,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1049304.6666666667, ans=0.125 2023-10-12 11:57:42,131 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.763e+02 1.986e+02 2.211e+02 3.011e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-12 11:57:51,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1049304.6666666667, ans=0.125 2023-10-12 11:58:13,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1049398.0, ans=0.125 2023-10-12 11:58:15,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1049444.6666666667, ans=0.0 2023-10-12 11:58:16,820 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 11:58:20,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1049444.6666666667, ans=0.125 2023-10-12 11:58:22,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049444.6666666667, ans=0.1 2023-10-12 11:58:23,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1049444.6666666667, ans=0.125 2023-10-12 11:58:25,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1049491.3333333333, ans=0.125 2023-10-12 11:58:32,503 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-10-12 11:59:08,295 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-10-12 11:59:14,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1049678.0, ans=0.0 2023-10-12 11:59:14,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.51 vs. limit=15.0 2023-10-12 11:59:16,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1049678.0, ans=0.125 2023-10-12 11:59:30,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049771.3333333333, ans=0.1 2023-10-12 11:59:30,772 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.662e+02 1.895e+02 2.171e+02 3.662e+02, threshold=3.791e+02, percent-clipped=0.0 2023-10-12 11:59:37,817 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.40 vs. limit=15.0 2023-10-12 11:59:43,789 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.28 vs. limit=22.5 2023-10-12 11:59:52,629 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-10-12 11:59:58,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1049864.6666666667, ans=0.125 2023-10-12 12:00:02,989 INFO [train.py:1031] (0/4) Epoch 17, batch 6500, loss[loss=0.201, simple_loss=0.2961, pruned_loss=0.05295, over 16961.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2847, pruned_loss=0.05211, over 31524012.60 frames. ], batch size: 123, lr: 2.05e-03, grad_scale: 16.0 2023-10-12 12:00:19,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1049958.0, ans=0.0 2023-10-12 12:00:23,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1049958.0, ans=0.07 2023-10-12 12:01:10,949 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.09 vs. limit=15.0 2023-10-12 12:01:20,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1050144.6666666667, ans=0.09899494936611666 2023-10-12 12:01:29,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1050191.3333333333, ans=0.0 2023-10-12 12:01:36,619 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.751e+02 1.889e+02 2.086e+02 2.685e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-12 12:01:45,517 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.37 vs. limit=10.0 2023-10-12 12:01:52,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1050284.6666666667, ans=0.125 2023-10-12 12:02:02,721 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.86 vs. limit=15.0 2023-10-12 12:02:32,010 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.90 vs. limit=15.0 2023-10-12 12:02:36,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1050471.3333333333, ans=0.125 2023-10-12 12:02:51,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1050564.6666666667, ans=0.07 2023-10-12 12:02:54,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1050564.6666666667, ans=0.1 2023-10-12 12:03:04,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1050611.3333333333, ans=0.125 2023-10-12 12:03:14,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1050658.0, ans=0.2 2023-10-12 12:03:25,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1050704.6666666667, ans=0.0 2023-10-12 12:03:25,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.782e+02 1.944e+02 2.207e+02 3.492e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-12 12:03:53,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1050798.0, ans=0.125 2023-10-12 12:04:03,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1050844.6666666667, ans=0.125 2023-10-12 12:04:14,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1050891.3333333333, ans=15.0 2023-10-12 12:04:16,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1050938.0, ans=0.07 2023-10-12 12:04:17,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1050938.0, ans=0.125 2023-10-12 12:04:32,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1050984.6666666667, ans=0.1 2023-10-12 12:05:07,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=12.0 2023-10-12 12:05:14,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1051171.3333333333, ans=0.0 2023-10-12 12:05:15,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1051171.3333333333, ans=0.125 2023-10-12 12:05:17,933 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.618e+02 1.873e+02 2.123e+02 3.198e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-12 12:05:20,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1051171.3333333333, ans=0.0 2023-10-12 12:05:21,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1051171.3333333333, ans=0.125 2023-10-12 12:05:34,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1051218.0, ans=0.2 2023-10-12 12:05:39,287 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=22.5 2023-10-12 12:05:41,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1051264.6666666667, ans=0.1 2023-10-12 12:05:49,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1051311.3333333333, ans=0.125 2023-10-12 12:06:01,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.56 vs. limit=6.0 2023-10-12 12:06:18,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1051358.0, ans=0.125 2023-10-12 12:06:18,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1051358.0, ans=0.125 2023-10-12 12:06:22,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1051404.6666666667, ans=0.1 2023-10-12 12:06:25,751 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:07:18,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1051591.3333333333, ans=0.1 2023-10-12 12:07:35,978 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-12 12:07:36,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.673e+02 1.829e+02 2.080e+02 3.657e+02, threshold=3.658e+02, percent-clipped=0.0 2023-10-12 12:07:39,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1051638.0, ans=0.2 2023-10-12 12:07:42,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1051638.0, ans=0.1 2023-10-12 12:07:42,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1051638.0, ans=0.125 2023-10-12 12:07:49,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1051684.6666666667, ans=0.125 2023-10-12 12:08:26,217 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-12 12:08:29,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1051871.3333333333, ans=0.125 2023-10-12 12:08:34,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1051871.3333333333, ans=0.125 2023-10-12 12:08:49,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1051918.0, ans=0.2 2023-10-12 12:08:52,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.55 vs. limit=22.5 2023-10-12 12:08:58,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1051964.6666666667, ans=0.125 2023-10-12 12:09:06,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1052011.3333333333, ans=0.125 2023-10-12 12:09:17,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=1052058.0, ans=0.95 2023-10-12 12:09:18,750 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:09:28,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.872e+02 2.247e+02 2.696e+02 3.848e+02, threshold=4.494e+02, percent-clipped=1.0 2023-10-12 12:09:28,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1052104.6666666667, ans=0.05 2023-10-12 12:09:31,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1052104.6666666667, ans=0.2 2023-10-12 12:09:31,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1052104.6666666667, ans=0.125 2023-10-12 12:09:36,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1052151.3333333333, ans=0.2 2023-10-12 12:09:44,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1052151.3333333333, ans=0.125 2023-10-12 12:09:48,628 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-10-12 12:09:56,562 INFO [train.py:1031] (0/4) Epoch 17, batch 7000, loss[loss=0.2191, simple_loss=0.3016, pruned_loss=0.06829, over 16571.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.285, pruned_loss=0.05195, over 31817593.24 frames. ], batch size: 61, lr: 2.05e-03, grad_scale: 16.0 2023-10-12 12:10:17,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1052291.3333333333, ans=0.125 2023-10-12 12:10:26,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1052338.0, ans=0.0 2023-10-12 12:10:34,695 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2023-10-12 12:10:45,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1052431.3333333333, ans=0.95 2023-10-12 12:10:56,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1052478.0, ans=0.125 2023-10-12 12:11:02,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1052478.0, ans=0.0 2023-10-12 12:11:04,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1052524.6666666667, ans=0.125 2023-10-12 12:11:19,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.704e+02 1.901e+02 2.069e+02 3.367e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-12 12:11:38,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1052664.6666666667, ans=0.125 2023-10-12 12:11:48,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1052711.3333333333, ans=0.0 2023-10-12 12:12:08,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.96 vs. limit=15.0 2023-10-12 12:12:10,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1052804.6666666667, ans=0.125 2023-10-12 12:12:18,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1052804.6666666667, ans=0.1 2023-10-12 12:12:21,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1052851.3333333333, ans=0.0 2023-10-12 12:12:26,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1052851.3333333333, ans=0.015 2023-10-12 12:12:34,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1052898.0, ans=0.015 2023-10-12 12:13:10,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.833e+02 1.987e+02 2.292e+02 3.128e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-12 12:13:29,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1053131.3333333333, ans=0.0 2023-10-12 12:13:30,837 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:13:35,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1053131.3333333333, ans=0.05 2023-10-12 12:13:55,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-10-12 12:14:03,267 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.33 vs. limit=15.0 2023-10-12 12:14:06,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1053224.6666666667, ans=0.0 2023-10-12 12:14:20,996 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:14:57,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1053411.3333333333, ans=0.125 2023-10-12 12:15:10,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1053458.0, ans=0.09899494936611666 2023-10-12 12:15:17,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.704e+02 1.817e+02 1.965e+02 2.679e+02, threshold=3.633e+02, percent-clipped=0.0 2023-10-12 12:15:18,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1053504.6666666667, ans=0.125 2023-10-12 12:15:30,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1053551.3333333333, ans=0.0 2023-10-12 12:15:45,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1053598.0, ans=0.125 2023-10-12 12:15:46,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053644.6666666667, ans=0.1 2023-10-12 12:15:58,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1053691.3333333333, ans=0.0 2023-10-12 12:16:05,383 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2023-10-12 12:16:09,781 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:16:11,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1053738.0, ans=0.125 2023-10-12 12:16:15,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1053738.0, ans=0.125 2023-10-12 12:16:26,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1053784.6666666667, ans=0.125 2023-10-12 12:16:38,076 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.07 vs. limit=10.0 2023-10-12 12:16:52,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1053878.0, ans=0.2 2023-10-12 12:16:53,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1053878.0, ans=0.125 2023-10-12 12:17:07,845 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.50 vs. limit=22.5 2023-10-12 12:17:12,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.756e+02 1.893e+02 2.102e+02 2.919e+02, threshold=3.787e+02, percent-clipped=0.0 2023-10-12 12:17:26,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1054018.0, ans=0.2 2023-10-12 12:17:32,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1054064.6666666667, ans=0.0 2023-10-12 12:17:38,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1054064.6666666667, ans=0.125 2023-10-12 12:17:51,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1054158.0, ans=0.125 2023-10-12 12:17:58,357 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.27 vs. limit=15.0 2023-10-12 12:18:04,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1054204.6666666667, ans=0.0 2023-10-12 12:18:39,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1054344.6666666667, ans=0.0 2023-10-12 12:18:43,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1054344.6666666667, ans=0.2 2023-10-12 12:18:52,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1054391.3333333333, ans=0.0 2023-10-12 12:18:55,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.11 vs. limit=10.0 2023-10-12 12:18:59,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1054438.0, ans=0.0 2023-10-12 12:19:00,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=1054438.0, ans=15.0 2023-10-12 12:19:01,227 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.762e+02 1.907e+02 2.113e+02 3.239e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-12 12:19:30,839 INFO [train.py:1031] (0/4) Epoch 17, batch 7500, loss[loss=0.1801, simple_loss=0.2754, pruned_loss=0.04237, over 16856.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2849, pruned_loss=0.05193, over 32030827.65 frames. ], batch size: 104, lr: 2.05e-03, grad_scale: 16.0 2023-10-12 12:19:36,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1054578.0, ans=0.1 2023-10-12 12:19:41,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1054624.6666666667, ans=0.125 2023-10-12 12:20:28,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1054811.3333333333, ans=0.125 2023-10-12 12:20:28,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1054811.3333333333, ans=0.2 2023-10-12 12:20:39,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1054858.0, ans=0.125 2023-10-12 12:20:46,438 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.35 vs. limit=15.0 2023-10-12 12:20:50,758 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.45 vs. limit=22.5 2023-10-12 12:20:55,313 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.720e+02 1.911e+02 2.056e+02 2.961e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-12 12:21:11,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1054951.3333333333, ans=0.0 2023-10-12 12:21:16,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1054998.0, ans=0.05 2023-10-12 12:21:34,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1055091.3333333333, ans=0.1 2023-10-12 12:21:35,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-10-12 12:21:41,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1055091.3333333333, ans=0.125 2023-10-12 12:21:43,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-10-12 12:22:48,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1055324.6666666667, ans=0.0 2023-10-12 12:22:56,863 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.746e+02 1.908e+02 2.213e+02 2.822e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-12 12:23:02,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1055418.0, ans=10.0 2023-10-12 12:23:04,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1055418.0, ans=0.125 2023-10-12 12:23:04,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1055418.0, ans=0.125 2023-10-12 12:23:16,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1055464.6666666667, ans=0.1 2023-10-12 12:23:20,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1055464.6666666667, ans=0.1 2023-10-12 12:23:26,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1055511.3333333333, ans=0.1 2023-10-12 12:23:27,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1055511.3333333333, ans=0.125 2023-10-12 12:23:27,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1055511.3333333333, ans=0.125 2023-10-12 12:23:35,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1055511.3333333333, ans=0.125 2023-10-12 12:23:35,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1055511.3333333333, ans=0.2 2023-10-12 12:23:36,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.95 vs. limit=10.0 2023-10-12 12:23:48,127 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-10-12 12:23:59,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=22.5 2023-10-12 12:24:23,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1055744.6666666667, ans=0.1 2023-10-12 12:24:33,731 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:24:38,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1055791.3333333333, ans=0.125 2023-10-12 12:24:38,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1055791.3333333333, ans=0.125 2023-10-12 12:24:41,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1055838.0, ans=0.0 2023-10-12 12:24:45,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1055838.0, ans=0.125 2023-10-12 12:24:45,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.13 vs. limit=15.0 2023-10-12 12:24:48,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.649e+02 1.833e+02 2.010e+02 2.977e+02, threshold=3.666e+02, percent-clipped=0.0 2023-10-12 12:24:59,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1055884.6666666667, ans=0.2 2023-10-12 12:25:07,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1055931.3333333333, ans=0.125 2023-10-12 12:25:12,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1055931.3333333333, ans=0.0 2023-10-12 12:25:38,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1056024.6666666667, ans=0.125 2023-10-12 12:25:46,346 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.32 vs. limit=10.0 2023-10-12 12:25:49,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1056071.3333333333, ans=0.0 2023-10-12 12:25:50,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1056071.3333333333, ans=0.125 2023-10-12 12:25:59,927 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-10-12 12:26:07,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1056164.6666666667, ans=0.125 2023-10-12 12:26:15,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1056164.6666666667, ans=0.02 2023-10-12 12:26:39,813 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-10-12 12:26:45,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.766e+02 1.924e+02 2.128e+02 2.966e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-12 12:27:30,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1056491.3333333333, ans=0.0 2023-10-12 12:27:31,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1056491.3333333333, ans=0.1 2023-10-12 12:27:36,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1056538.0, ans=0.2 2023-10-12 12:27:38,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1056538.0, ans=0.0 2023-10-12 12:27:48,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.33 vs. limit=22.5 2023-10-12 12:27:58,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1056584.6666666667, ans=0.0 2023-10-12 12:28:02,777 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.41 vs. limit=15.0 2023-10-12 12:28:28,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1056724.6666666667, ans=0.2 2023-10-12 12:28:35,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1056724.6666666667, ans=0.0 2023-10-12 12:28:43,604 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.618e+02 1.746e+02 1.903e+02 2.676e+02, threshold=3.492e+02, percent-clipped=0.0 2023-10-12 12:29:02,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1056864.6666666667, ans=0.125 2023-10-12 12:29:05,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2023-10-12 12:29:11,118 INFO [train.py:1031] (0/4) Epoch 17, batch 8000, loss[loss=0.1704, simple_loss=0.2752, pruned_loss=0.03285, over 16878.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2843, pruned_loss=0.05133, over 32213686.62 frames. ], batch size: 104, lr: 2.04e-03, grad_scale: 32.0 2023-10-12 12:29:11,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1056911.3333333333, ans=0.125 2023-10-12 12:29:19,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1056911.3333333333, ans=0.07 2023-10-12 12:29:25,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1056958.0, ans=0.125 2023-10-12 12:29:31,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1056958.0, ans=0.2 2023-10-12 12:29:33,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1057004.6666666667, ans=0.0 2023-10-12 12:29:35,460 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.89 vs. limit=15.0 2023-10-12 12:29:37,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1057004.6666666667, ans=0.0 2023-10-12 12:29:52,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1057051.3333333333, ans=0.125 2023-10-12 12:30:07,479 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:30:22,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1057191.3333333333, ans=0.125 2023-10-12 12:30:32,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.608e+02 1.770e+02 1.954e+02 2.497e+02, threshold=3.541e+02, percent-clipped=0.0 2023-10-12 12:30:34,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1057238.0, ans=0.125 2023-10-12 12:30:41,507 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=8.0 2023-10-12 12:30:46,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1057284.6666666667, ans=0.1 2023-10-12 12:30:50,544 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-10-12 12:30:53,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-10-12 12:30:54,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1057331.3333333333, ans=0.0 2023-10-12 12:30:57,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1057331.3333333333, ans=0.2 2023-10-12 12:30:57,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1057331.3333333333, ans=15.0 2023-10-12 12:31:04,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1057378.0, ans=0.125 2023-10-12 12:31:35,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-10-12 12:31:42,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1057564.6666666667, ans=0.0 2023-10-12 12:32:02,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1057611.3333333333, ans=0.125 2023-10-12 12:32:15,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1057658.0, ans=0.04949747468305833 2023-10-12 12:32:22,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1057658.0, ans=0.125 2023-10-12 12:32:22,724 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.08 vs. limit=22.5 2023-10-12 12:32:29,256 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-10-12 12:32:36,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.811e+02 1.950e+02 2.285e+02 3.064e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-12 12:32:38,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1057704.6666666667, ans=0.2 2023-10-12 12:32:39,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1057704.6666666667, ans=0.125 2023-10-12 12:32:48,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1057751.3333333333, ans=0.1 2023-10-12 12:32:52,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1057751.3333333333, ans=0.1 2023-10-12 12:32:53,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1057751.3333333333, ans=0.05 2023-10-12 12:32:56,250 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-12 12:33:01,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=12.0 2023-10-12 12:33:20,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1057891.3333333333, ans=0.07 2023-10-12 12:33:32,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1057938.0, ans=0.125 2023-10-12 12:33:50,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1057984.6666666667, ans=0.0 2023-10-12 12:33:52,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1058031.3333333333, ans=0.1 2023-10-12 12:34:02,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=1058031.3333333333, ans=0.025 2023-10-12 12:34:22,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1058124.6666666667, ans=0.125 2023-10-12 12:34:33,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.767e+02 1.935e+02 2.120e+02 2.620e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-12 12:35:08,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1058311.3333333333, ans=0.2 2023-10-12 12:35:18,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1058358.0, ans=0.1 2023-10-12 12:35:22,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1058358.0, ans=0.0 2023-10-12 12:35:22,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1058358.0, ans=0.0 2023-10-12 12:35:39,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.32 vs. limit=22.5 2023-10-12 12:35:40,562 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.49 vs. limit=22.5 2023-10-12 12:35:41,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1058451.3333333333, ans=0.125 2023-10-12 12:35:53,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1058498.0, ans=0.0 2023-10-12 12:35:58,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1058544.6666666667, ans=0.125 2023-10-12 12:36:05,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1058544.6666666667, ans=0.125 2023-10-12 12:36:09,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1058591.3333333333, ans=0.125 2023-10-12 12:36:12,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1058591.3333333333, ans=0.0 2023-10-12 12:36:17,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1058591.3333333333, ans=0.125 2023-10-12 12:36:25,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1058638.0, ans=0.125 2023-10-12 12:36:25,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1058638.0, ans=0.125 2023-10-12 12:36:27,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.755e+02 1.937e+02 2.086e+02 2.976e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-12 12:36:34,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1058684.6666666667, ans=0.125 2023-10-12 12:36:46,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1058731.3333333333, ans=0.0 2023-10-12 12:37:25,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1058871.3333333333, ans=0.0 2023-10-12 12:37:31,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1058918.0, ans=0.2 2023-10-12 12:37:32,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1058918.0, ans=0.0 2023-10-12 12:37:47,945 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.961e-02 2023-10-12 12:37:50,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1058964.6666666667, ans=0.125 2023-10-12 12:37:53,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1059011.3333333333, ans=0.125 2023-10-12 12:37:56,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1059011.3333333333, ans=0.2 2023-10-12 12:38:00,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1059011.3333333333, ans=0.125 2023-10-12 12:38:08,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.98 vs. limit=15.0 2023-10-12 12:38:24,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.742e+02 1.967e+02 2.236e+02 2.982e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-12 12:38:24,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1059104.6666666667, ans=0.2 2023-10-12 12:38:48,820 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.22 vs. limit=15.0 2023-10-12 12:38:54,323 INFO [train.py:1031] (0/4) Epoch 17, batch 8500, loss[loss=0.1978, simple_loss=0.285, pruned_loss=0.05529, over 16720.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2845, pruned_loss=0.05121, over 32341119.62 frames. ], batch size: 202, lr: 2.04e-03, grad_scale: 16.0 2023-10-12 12:38:56,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1059244.6666666667, ans=0.125 2023-10-12 12:38:56,724 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.96 vs. limit=15.0 2023-10-12 12:39:04,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.74 vs. limit=10.0 2023-10-12 12:39:15,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.00 vs. limit=22.5 2023-10-12 12:39:32,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.18 vs. limit=22.5 2023-10-12 12:39:45,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1059431.3333333333, ans=0.1 2023-10-12 12:39:47,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1059431.3333333333, ans=0.1 2023-10-12 12:40:23,040 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.834e+02 2.054e+02 2.316e+02 3.239e+02, threshold=4.109e+02, percent-clipped=0.0 2023-10-12 12:40:24,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1059618.0, ans=0.125 2023-10-12 12:40:45,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.95 vs. limit=15.0 2023-10-12 12:40:46,473 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:40:54,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1059711.3333333333, ans=0.125 2023-10-12 12:41:20,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1059804.6666666667, ans=0.125 2023-10-12 12:41:24,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1059804.6666666667, ans=0.125 2023-10-12 12:41:24,588 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=15.0 2023-10-12 12:41:33,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1059851.3333333333, ans=0.0 2023-10-12 12:41:44,585 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-10-12 12:41:45,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1059898.0, ans=0.0 2023-10-12 12:42:01,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1059944.6666666667, ans=0.125 2023-10-12 12:42:14,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1059991.3333333333, ans=0.2 2023-10-12 12:42:28,329 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.669e+02 1.819e+02 1.997e+02 2.629e+02, threshold=3.638e+02, percent-clipped=0.0 2023-10-12 12:42:33,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1060084.6666666667, ans=0.0 2023-10-12 12:42:44,678 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=6.468e-02 2023-10-12 12:42:49,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1060131.3333333333, ans=0.125 2023-10-12 12:42:59,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1060178.0, ans=0.125 2023-10-12 12:43:05,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1060224.6666666667, ans=0.125 2023-10-12 12:43:18,732 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-10-12 12:44:19,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1060504.6666666667, ans=0.0 2023-10-12 12:44:26,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1060504.6666666667, ans=0.125 2023-10-12 12:44:26,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1060504.6666666667, ans=0.125 2023-10-12 12:44:26,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.628e+02 1.744e+02 1.938e+02 2.517e+02, threshold=3.487e+02, percent-clipped=0.0 2023-10-12 12:44:27,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1060504.6666666667, ans=0.5 2023-10-12 12:44:43,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1060598.0, ans=0.125 2023-10-12 12:44:59,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1060644.6666666667, ans=0.1 2023-10-12 12:45:25,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1060784.6666666667, ans=0.125 2023-10-12 12:45:44,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1060878.0, ans=0.0 2023-10-12 12:45:55,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.50 vs. limit=15.0 2023-10-12 12:46:07,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1060971.3333333333, ans=0.0 2023-10-12 12:46:13,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1060971.3333333333, ans=0.125 2023-10-12 12:46:14,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.250e+02 1.695e+02 1.807e+02 1.988e+02 3.231e+02, threshold=3.613e+02, percent-clipped=0.0 2023-10-12 12:46:41,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1061111.3333333333, ans=0.125 2023-10-12 12:46:41,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=22.5 2023-10-12 12:46:44,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1061111.3333333333, ans=0.2 2023-10-12 12:46:51,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1061158.0, ans=0.125 2023-10-12 12:46:54,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1061158.0, ans=0.125 2023-10-12 12:46:55,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1061158.0, ans=0.09899494936611666 2023-10-12 12:47:07,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1061204.6666666667, ans=0.125 2023-10-12 12:47:37,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1061344.6666666667, ans=0.1 2023-10-12 12:48:03,493 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-10-12 12:48:04,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.721e+02 1.907e+02 2.136e+02 3.255e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-12 12:48:10,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1061484.6666666667, ans=0.0 2023-10-12 12:48:23,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-10-12 12:48:26,731 INFO [train.py:1031] (0/4) Epoch 17, batch 9000, loss[loss=0.2172, simple_loss=0.3156, pruned_loss=0.05936, over 16645.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.284, pruned_loss=0.05106, over 32442444.82 frames. ], batch size: 202, lr: 2.04e-03, grad_scale: 8.0 2023-10-12 12:48:54,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1061671.3333333333, ans=0.04949747468305833 2023-10-12 12:48:55,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1061671.3333333333, ans=0.1 2023-10-12 12:49:08,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1061764.6666666667, ans=0.0 2023-10-12 12:49:13,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1061764.6666666667, ans=0.0 2023-10-12 12:49:17,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1061764.6666666667, ans=0.0 2023-10-12 12:49:43,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1061904.6666666667, ans=0.0 2023-10-12 12:49:47,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.676e+02 1.876e+02 2.157e+02 3.505e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-12 12:49:55,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1061951.3333333333, ans=0.125 2023-10-12 12:50:00,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1061998.0, ans=0.125 2023-10-12 12:50:16,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1062044.6666666667, ans=0.125 2023-10-12 12:50:37,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1062138.0, ans=0.125 2023-10-12 12:50:48,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1062184.6666666667, ans=0.1 2023-10-12 12:51:29,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1062371.3333333333, ans=0.5 2023-10-12 12:51:30,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1062371.3333333333, ans=0.1 2023-10-12 12:51:37,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.774e+02 1.870e+02 2.130e+02 3.113e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-12 12:51:40,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1062418.0, ans=0.125 2023-10-12 12:51:53,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1062464.6666666667, ans=0.125 2023-10-12 12:52:03,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1062511.3333333333, ans=0.125 2023-10-12 12:52:05,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1062511.3333333333, ans=0.1 2023-10-12 12:52:09,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.53 vs. limit=15.0 2023-10-12 12:52:20,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1062604.6666666667, ans=0.125 2023-10-12 12:52:22,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1062604.6666666667, ans=0.0 2023-10-12 12:52:45,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1062698.0, ans=0.125 2023-10-12 12:52:57,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1062744.6666666667, ans=0.05 2023-10-12 12:53:01,414 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:53:02,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1062791.3333333333, ans=0.0 2023-10-12 12:53:05,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1062791.3333333333, ans=0.125 2023-10-12 12:53:19,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062838.0, ans=0.1 2023-10-12 12:53:22,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.715e+02 1.910e+02 2.133e+02 3.286e+02, threshold=3.820e+02, percent-clipped=0.0 2023-10-12 12:53:26,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=15.0 2023-10-12 12:53:38,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1062931.3333333333, ans=0.0 2023-10-12 12:53:48,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1062978.0, ans=0.04949747468305833 2023-10-12 12:53:52,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1062978.0, ans=0.09899494936611666 2023-10-12 12:53:52,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062978.0, ans=0.1 2023-10-12 12:54:00,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1063024.6666666667, ans=0.1 2023-10-12 12:54:05,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1063024.6666666667, ans=0.0 2023-10-12 12:54:29,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1063118.0, ans=0.125 2023-10-12 12:54:36,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1063164.6666666667, ans=0.125 2023-10-12 12:54:37,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.18 vs. limit=10.0 2023-10-12 12:54:50,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1063211.3333333333, ans=0.1 2023-10-12 12:55:05,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1063258.0, ans=0.2 2023-10-12 12:55:23,151 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.816e+02 1.959e+02 2.108e+02 2.932e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-12 12:55:23,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1063351.3333333333, ans=0.05 2023-10-12 12:55:35,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1063398.0, ans=0.1 2023-10-12 12:55:39,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1063398.0, ans=22.5 2023-10-12 12:55:39,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1063398.0, ans=0.0 2023-10-12 12:55:41,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1063398.0, ans=0.1 2023-10-12 12:55:59,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1063491.3333333333, ans=0.125 2023-10-12 12:56:20,265 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:56:38,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1063631.3333333333, ans=0.125 2023-10-12 12:56:55,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1063724.6666666667, ans=0.04949747468305833 2023-10-12 12:56:55,468 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-10-12 12:56:58,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1063724.6666666667, ans=0.125 2023-10-12 12:57:03,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1063724.6666666667, ans=0.125 2023-10-12 12:57:15,185 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.75 vs. limit=10.0 2023-10-12 12:57:19,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.839e+02 2.048e+02 2.293e+02 3.164e+02, threshold=4.096e+02, percent-clipped=0.0 2023-10-12 12:57:41,889 INFO [train.py:1031] (0/4) Epoch 17, batch 9500, loss[loss=0.2056, simple_loss=0.2988, pruned_loss=0.05623, over 16896.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2845, pruned_loss=0.05125, over 32503059.14 frames. ], batch size: 165, lr: 2.04e-03, grad_scale: 8.0 2023-10-12 12:58:00,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1063958.0, ans=0.0 2023-10-12 12:58:01,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1063958.0, ans=0.1 2023-10-12 12:58:06,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1064004.6666666667, ans=0.125 2023-10-12 12:58:15,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1064051.3333333333, ans=0.0 2023-10-12 12:58:38,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-10-12 12:58:57,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1064191.3333333333, ans=0.0 2023-10-12 12:59:04,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1064238.0, ans=0.0 2023-10-12 12:59:04,351 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:59:10,499 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.704e+02 1.873e+02 2.130e+02 2.900e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-12 12:59:15,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1064284.6666666667, ans=0.0 2023-10-12 12:59:22,721 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 12:59:57,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1064471.3333333333, ans=0.1 2023-10-12 13:00:13,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1064518.0, ans=0.015 2023-10-12 13:00:18,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1064564.6666666667, ans=0.1 2023-10-12 13:00:36,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1064611.3333333333, ans=0.0 2023-10-12 13:00:38,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.08 vs. limit=22.5 2023-10-12 13:00:44,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1064658.0, ans=0.125 2023-10-12 13:00:47,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1064658.0, ans=0.125 2023-10-12 13:00:52,696 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.62 vs. limit=15.0 2023-10-12 13:01:06,488 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.691e+02 1.938e+02 2.154e+02 3.488e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-12 13:01:15,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1064798.0, ans=0.2 2023-10-12 13:01:16,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1064798.0, ans=0.125 2023-10-12 13:01:49,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1064938.0, ans=0.2 2023-10-12 13:02:04,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1064984.6666666667, ans=0.125 2023-10-12 13:02:18,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1065031.3333333333, ans=0.125 2023-10-12 13:02:31,521 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-10-12 13:02:41,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1065124.6666666667, ans=0.125 2023-10-12 13:02:47,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-10-12 13:02:48,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1065171.3333333333, ans=0.125 2023-10-12 13:02:56,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1065218.0, ans=6.0 2023-10-12 13:02:58,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.804e+02 1.966e+02 2.153e+02 2.862e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-12 13:03:06,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1065218.0, ans=0.0 2023-10-12 13:03:07,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1065218.0, ans=0.125 2023-10-12 13:03:12,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1065264.6666666667, ans=0.125 2023-10-12 13:03:13,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1065264.6666666667, ans=0.1 2023-10-12 13:03:13,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1065264.6666666667, ans=0.125 2023-10-12 13:03:36,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1065358.0, ans=0.0 2023-10-12 13:03:42,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1065404.6666666667, ans=0.125 2023-10-12 13:03:58,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1065451.3333333333, ans=0.2 2023-10-12 13:04:03,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1065498.0, ans=0.125 2023-10-12 13:04:04,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1065498.0, ans=0.1 2023-10-12 13:04:12,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.13 vs. limit=15.0 2023-10-12 13:04:29,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.62 vs. limit=15.0 2023-10-12 13:04:39,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.18 vs. limit=10.0 2023-10-12 13:04:41,339 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.20 vs. limit=15.0 2023-10-12 13:04:47,683 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.679e+02 1.810e+02 2.151e+02 3.239e+02, threshold=3.621e+02, percent-clipped=0.0 2023-10-12 13:05:05,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1065731.3333333333, ans=0.0 2023-10-12 13:05:17,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.02 vs. limit=22.5 2023-10-12 13:05:22,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1065824.6666666667, ans=0.0 2023-10-12 13:05:25,903 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-10-12 13:05:27,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1065824.6666666667, ans=0.2 2023-10-12 13:05:41,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.99 vs. limit=6.0 2023-10-12 13:06:18,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.39 vs. limit=15.0 2023-10-12 13:06:24,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.79 vs. limit=15.0 2023-10-12 13:06:33,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.705e+02 1.860e+02 2.121e+02 3.213e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-12 13:06:42,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1066198.0, ans=0.125 2023-10-12 13:06:46,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-10-12 13:06:53,083 INFO [train.py:1031] (0/4) Epoch 17, batch 10000, loss[loss=0.1881, simple_loss=0.2487, pruned_loss=0.06379, over 12526.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2837, pruned_loss=0.05096, over 32571522.89 frames. ], batch size: 440, lr: 2.04e-03, grad_scale: 16.0 2023-10-12 13:06:53,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1066244.6666666667, ans=0.125 2023-10-12 13:07:18,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1066338.0, ans=0.125 2023-10-12 13:07:28,907 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.24 vs. limit=15.0 2023-10-12 13:07:40,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1066431.3333333333, ans=0.125 2023-10-12 13:07:52,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1066478.0, ans=0.125 2023-10-12 13:07:59,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.33 vs. limit=15.0 2023-10-12 13:08:01,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1066524.6666666667, ans=0.035 2023-10-12 13:08:22,007 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.734e+02 1.873e+02 2.050e+02 3.057e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-12 13:08:28,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1066618.0, ans=0.07 2023-10-12 13:08:30,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1066618.0, ans=0.125 2023-10-12 13:08:46,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1066711.3333333333, ans=0.2 2023-10-12 13:08:48,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1066711.3333333333, ans=0.125 2023-10-12 13:09:01,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1066758.0, ans=0.2 2023-10-12 13:09:05,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.66 vs. limit=15.0 2023-10-12 13:09:05,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.69 vs. limit=10.0 2023-10-12 13:09:08,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1066804.6666666667, ans=0.0 2023-10-12 13:09:10,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1066804.6666666667, ans=0.125 2023-10-12 13:09:18,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1066851.3333333333, ans=0.125 2023-10-12 13:09:26,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1066851.3333333333, ans=0.2 2023-10-12 13:09:31,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-10-12 13:09:55,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1066991.3333333333, ans=0.0 2023-10-12 13:09:56,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1066991.3333333333, ans=10.0 2023-10-12 13:10:11,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.747e+02 1.915e+02 2.150e+02 3.032e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-12 13:10:13,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.35 vs. limit=22.5 2023-10-12 13:10:18,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1067084.6666666667, ans=0.0 2023-10-12 13:10:34,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1067178.0, ans=0.1 2023-10-12 13:11:16,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1067318.0, ans=0.125 2023-10-12 13:11:25,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1067364.6666666667, ans=0.125 2023-10-12 13:11:32,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1067411.3333333333, ans=0.125 2023-10-12 13:11:33,696 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-10-12 13:12:09,256 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.779e+02 1.929e+02 2.141e+02 2.754e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-12 13:12:09,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1067551.3333333333, ans=0.1 2023-10-12 13:12:22,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1067598.0, ans=0.0 2023-10-12 13:12:28,927 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=15.0 2023-10-12 13:13:16,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1067831.3333333333, ans=0.2 2023-10-12 13:13:45,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1067924.6666666667, ans=0.0 2023-10-12 13:13:52,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1067971.3333333333, ans=0.125 2023-10-12 13:13:54,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1067971.3333333333, ans=0.05 2023-10-12 13:13:54,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.64 vs. limit=15.0 2023-10-12 13:14:01,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1068018.0, ans=0.0 2023-10-12 13:14:02,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.713e+02 1.858e+02 2.086e+02 2.743e+02, threshold=3.716e+02, percent-clipped=0.0 2023-10-12 13:14:08,127 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-10-12 13:14:12,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1068064.6666666667, ans=0.2 2023-10-12 13:14:13,507 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=22.5 2023-10-12 13:14:14,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1068064.6666666667, ans=0.0 2023-10-12 13:14:17,064 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-12 13:14:22,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1068064.6666666667, ans=0.0 2023-10-12 13:14:29,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1068111.3333333333, ans=22.5 2023-10-12 13:14:52,383 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-10-12 13:14:57,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1068251.3333333333, ans=0.125 2023-10-12 13:15:25,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1068344.6666666667, ans=0.125 2023-10-12 13:15:53,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.763e+02 2.004e+02 2.248e+02 3.077e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-12 13:15:57,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1068484.6666666667, ans=0.125 2023-10-12 13:16:01,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1068484.6666666667, ans=0.0 2023-10-12 13:16:06,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1068531.3333333333, ans=0.125 2023-10-12 13:16:09,071 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.84 vs. limit=15.0 2023-10-12 13:16:10,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-10-12 13:16:13,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1068578.0, ans=0.125 2023-10-12 13:16:14,255 INFO [train.py:1031] (0/4) Epoch 17, batch 10500, loss[loss=0.1673, simple_loss=0.2652, pruned_loss=0.03474, over 16812.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2846, pruned_loss=0.0513, over 32629077.01 frames. ], batch size: 87, lr: 2.03e-03, grad_scale: 32.0 2023-10-12 13:16:14,794 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.06 vs. limit=15.0 2023-10-12 13:16:16,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1068578.0, ans=0.125 2023-10-12 13:16:26,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068624.6666666667, ans=0.1 2023-10-12 13:16:28,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1068624.6666666667, ans=0.125 2023-10-12 13:16:29,313 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.53 vs. limit=12.0 2023-10-12 13:16:30,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1068624.6666666667, ans=0.0 2023-10-12 13:16:32,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1068624.6666666667, ans=0.0 2023-10-12 13:16:42,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1068671.3333333333, ans=0.125 2023-10-12 13:16:43,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1068718.0, ans=0.0 2023-10-12 13:17:48,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.757e+02 1.856e+02 2.039e+02 2.751e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-12 13:17:52,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1068951.3333333333, ans=0.125 2023-10-12 13:18:06,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1068998.0, ans=0.125 2023-10-12 13:18:07,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1068998.0, ans=0.125 2023-10-12 13:18:16,518 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:18:20,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1069044.6666666667, ans=0.0 2023-10-12 13:18:28,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1069091.3333333333, ans=0.2 2023-10-12 13:18:55,512 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:19:11,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069278.0, ans=0.1 2023-10-12 13:19:18,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069324.6666666667, ans=0.1 2023-10-12 13:19:18,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1069324.6666666667, ans=0.05 2023-10-12 13:19:42,693 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.765e+02 1.930e+02 2.122e+02 3.086e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-12 13:19:48,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.65 vs. limit=15.0 2023-10-12 13:20:06,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1069511.3333333333, ans=0.2 2023-10-12 13:20:11,108 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:20:19,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1069558.0, ans=0.0 2023-10-12 13:20:26,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1069604.6666666667, ans=0.04949747468305833 2023-10-12 13:20:26,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1069604.6666666667, ans=0.2 2023-10-12 13:20:29,021 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:20:56,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.83 vs. limit=22.5 2023-10-12 13:21:14,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1069791.3333333333, ans=0.125 2023-10-12 13:21:30,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1069884.6666666667, ans=0.05 2023-10-12 13:21:31,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1069884.6666666667, ans=0.035 2023-10-12 13:21:32,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.875e+02 2.089e+02 2.329e+02 3.653e+02, threshold=4.177e+02, percent-clipped=0.0 2023-10-12 13:21:35,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1069884.6666666667, ans=0.0 2023-10-12 13:21:49,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-10-12 13:22:29,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1070118.0, ans=0.2 2023-10-12 13:22:29,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-10-12 13:22:31,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1070118.0, ans=0.05 2023-10-12 13:22:55,796 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-12 13:22:59,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1070258.0, ans=0.1 2023-10-12 13:23:23,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.702e+02 1.853e+02 2.050e+02 3.321e+02, threshold=3.706e+02, percent-clipped=0.0 2023-10-12 13:23:25,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1070351.3333333333, ans=0.1 2023-10-12 13:23:29,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-10-12 13:23:35,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1070398.0, ans=0.0 2023-10-12 13:23:50,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1070444.6666666667, ans=0.0 2023-10-12 13:23:52,617 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:24:01,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1070491.3333333333, ans=0.125 2023-10-12 13:24:04,320 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-12 13:24:18,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1070584.6666666667, ans=0.125 2023-10-12 13:24:34,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1070631.3333333333, ans=0.125 2023-10-12 13:24:42,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1070678.0, ans=0.125 2023-10-12 13:24:49,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1070724.6666666667, ans=0.1 2023-10-12 13:25:14,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.790e+02 1.989e+02 2.238e+02 3.576e+02, threshold=3.979e+02, percent-clipped=0.0 2023-10-12 13:25:16,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1070818.0, ans=0.125 2023-10-12 13:25:21,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1070864.6666666667, ans=0.035 2023-10-12 13:25:25,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1070864.6666666667, ans=0.125 2023-10-12 13:25:32,143 INFO [train.py:1031] (0/4) Epoch 17, batch 11000, loss[loss=0.2008, simple_loss=0.2895, pruned_loss=0.05601, over 15460.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2845, pruned_loss=0.05155, over 32620107.98 frames. ], batch size: 35, lr: 2.03e-03, grad_scale: 16.0 2023-10-12 13:26:00,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1071004.6666666667, ans=0.1 2023-10-12 13:26:01,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1071051.3333333333, ans=0.1 2023-10-12 13:26:06,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1071051.3333333333, ans=0.125 2023-10-12 13:26:09,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1071051.3333333333, ans=0.0 2023-10-12 13:26:12,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1071098.0, ans=0.125 2023-10-12 13:26:12,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1071098.0, ans=0.0 2023-10-12 13:26:25,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1071144.6666666667, ans=0.125 2023-10-12 13:26:33,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1071144.6666666667, ans=0.125 2023-10-12 13:26:43,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1071191.3333333333, ans=0.0 2023-10-12 13:27:03,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.735e+02 1.926e+02 2.173e+02 3.469e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-12 13:27:04,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1071284.6666666667, ans=0.0 2023-10-12 13:27:06,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=22.5 2023-10-12 13:27:07,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1071284.6666666667, ans=15.0 2023-10-12 13:27:36,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1071378.0, ans=0.125 2023-10-12 13:27:38,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-10-12 13:27:42,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1071424.6666666667, ans=0.0 2023-10-12 13:27:42,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1071424.6666666667, ans=0.125 2023-10-12 13:27:51,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1071471.3333333333, ans=0.125 2023-10-12 13:28:11,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1071518.0, ans=0.1 2023-10-12 13:28:29,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1071611.3333333333, ans=0.07 2023-10-12 13:28:38,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1071658.0, ans=0.0 2023-10-12 13:29:03,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.340e+02 1.618e+02 1.812e+02 1.995e+02 3.412e+02, threshold=3.624e+02, percent-clipped=0.0 2023-10-12 13:29:32,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1071891.3333333333, ans=0.0 2023-10-12 13:29:48,185 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.46 vs. limit=22.5 2023-10-12 13:29:48,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1071938.0, ans=0.125 2023-10-12 13:30:11,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1072031.3333333333, ans=15.0 2023-10-12 13:30:14,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1072078.0, ans=0.125 2023-10-12 13:30:23,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.21 vs. limit=10.0 2023-10-12 13:30:27,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1072124.6666666667, ans=0.125 2023-10-12 13:30:38,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1072171.3333333333, ans=0.125 2023-10-12 13:30:55,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.707e+02 1.882e+02 2.064e+02 2.663e+02, threshold=3.763e+02, percent-clipped=0.0 2023-10-12 13:31:21,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1072311.3333333333, ans=0.1 2023-10-12 13:31:29,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1072358.0, ans=0.1 2023-10-12 13:31:30,843 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.96 vs. limit=22.5 2023-10-12 13:31:57,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1072451.3333333333, ans=0.1 2023-10-12 13:32:04,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1072498.0, ans=0.0 2023-10-12 13:32:15,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2023-10-12 13:32:23,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1072591.3333333333, ans=0.0 2023-10-12 13:32:47,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1072684.6666666667, ans=0.125 2023-10-12 13:32:48,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.792e+02 1.979e+02 2.254e+02 3.040e+02, threshold=3.957e+02, percent-clipped=0.0 2023-10-12 13:32:48,719 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.45 vs. limit=15.0 2023-10-12 13:32:59,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.41 vs. limit=15.0 2023-10-12 13:33:23,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1072824.6666666667, ans=0.0 2023-10-12 13:33:47,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1072918.0, ans=0.125 2023-10-12 13:34:05,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1073011.3333333333, ans=0.0 2023-10-12 13:34:35,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1073104.6666666667, ans=0.2 2023-10-12 13:34:45,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.824e+02 2.018e+02 2.300e+02 3.205e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-12 13:34:46,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1073151.3333333333, ans=0.035 2023-10-12 13:34:46,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1073151.3333333333, ans=0.0 2023-10-12 13:34:52,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1073198.0, ans=0.2 2023-10-12 13:35:06,600 INFO [train.py:1031] (0/4) Epoch 17, batch 11500, loss[loss=0.196, simple_loss=0.2873, pruned_loss=0.05234, over 16582.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2842, pruned_loss=0.05127, over 32683374.16 frames. ], batch size: 266, lr: 2.03e-03, grad_scale: 32.0 2023-10-12 13:35:21,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1073291.3333333333, ans=0.2 2023-10-12 13:35:22,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1073291.3333333333, ans=0.125 2023-10-12 13:35:29,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1073338.0, ans=0.5 2023-10-12 13:35:56,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1073431.3333333333, ans=0.09899494936611666 2023-10-12 13:36:00,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1073431.3333333333, ans=22.5 2023-10-12 13:36:02,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1073431.3333333333, ans=0.125 2023-10-12 13:36:09,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1073478.0, ans=0.125 2023-10-12 13:36:27,648 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:36:31,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1073571.3333333333, ans=0.1 2023-10-12 13:36:42,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.697e+02 1.844e+02 2.086e+02 2.731e+02, threshold=3.689e+02, percent-clipped=0.0 2023-10-12 13:36:50,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1073618.0, ans=0.125 2023-10-12 13:36:58,569 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.19 vs. limit=15.0 2023-10-12 13:36:59,938 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-12 13:37:01,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1073664.6666666667, ans=0.0 2023-10-12 13:37:02,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-10-12 13:37:29,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1073804.6666666667, ans=0.125 2023-10-12 13:37:32,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1073804.6666666667, ans=0.05 2023-10-12 13:37:42,192 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 13:37:44,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1073851.3333333333, ans=0.125 2023-10-12 13:37:48,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.07 vs. limit=15.0 2023-10-12 13:37:59,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1073898.0, ans=0.1 2023-10-12 13:38:03,235 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-10-12 13:38:31,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1074038.0, ans=0.0 2023-10-12 13:38:43,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.642e+02 1.835e+02 2.085e+02 2.917e+02, threshold=3.670e+02, percent-clipped=0.0 2023-10-12 13:39:05,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1074178.0, ans=0.2 2023-10-12 13:39:36,155 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.62 vs. limit=6.0 2023-10-12 13:39:45,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1074364.6666666667, ans=0.2 2023-10-12 13:40:49,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.746e+02 1.944e+02 2.115e+02 3.203e+02, threshold=3.887e+02, percent-clipped=0.0 2023-10-12 13:40:52,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1074551.3333333333, ans=0.0 2023-10-12 13:41:12,730 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-10-12 13:41:34,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1074738.0, ans=0.125 2023-10-12 13:41:38,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1074738.0, ans=0.0 2023-10-12 13:42:17,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1074924.6666666667, ans=0.125 2023-10-12 13:42:38,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1074971.3333333333, ans=0.125 2023-10-12 13:42:47,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.715e+02 1.909e+02 2.108e+02 2.697e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-12 13:42:59,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1075064.6666666667, ans=0.125 2023-10-12 13:43:00,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1075064.6666666667, ans=0.125 2023-10-12 13:43:12,177 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=12.0 2023-10-12 13:43:14,502 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-10-12 13:43:31,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1075204.6666666667, ans=0.125 2023-10-12 13:43:44,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1075251.3333333333, ans=0.125 2023-10-12 13:43:51,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1075298.0, ans=0.125 2023-10-12 13:43:54,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1075298.0, ans=0.125 2023-10-12 13:44:06,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1075344.6666666667, ans=0.1 2023-10-12 13:44:26,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1075438.0, ans=0.0 2023-10-12 13:44:28,772 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-10-12 13:44:30,640 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.92 vs. limit=10.0 2023-10-12 13:44:34,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1075438.0, ans=0.125 2023-10-12 13:44:42,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.733e+02 1.913e+02 2.164e+02 3.183e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-12 13:44:59,405 INFO [train.py:1031] (0/4) Epoch 17, batch 12000, loss[loss=0.1873, simple_loss=0.2772, pruned_loss=0.0487, over 16341.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2843, pruned_loss=0.05113, over 32703200.49 frames. ], batch size: 50, lr: 2.03e-03, grad_scale: 32.0 2023-10-12 13:45:00,437 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2023-10-12 13:45:09,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1075578.0, ans=0.0 2023-10-12 13:45:18,846 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.44 vs. limit=12.0 2023-10-12 13:45:24,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1075671.3333333333, ans=0.035 2023-10-12 13:45:41,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1075718.0, ans=0.0 2023-10-12 13:45:43,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1075718.0, ans=0.1 2023-10-12 13:45:43,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1075718.0, ans=0.1 2023-10-12 13:45:47,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1075764.6666666667, ans=0.1 2023-10-12 13:45:58,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1075811.3333333333, ans=0.95 2023-10-12 13:46:04,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-10-12 13:46:08,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1075811.3333333333, ans=0.0 2023-10-12 13:46:12,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1075858.0, ans=0.95 2023-10-12 13:46:19,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=1075858.0, ans=0.5 2023-10-12 13:46:31,971 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.52 vs. limit=10.0 2023-10-12 13:46:39,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.667e+02 1.885e+02 2.136e+02 3.356e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-12 13:46:43,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.56 vs. limit=15.0 2023-10-12 13:46:52,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1075998.0, ans=0.0 2023-10-12 13:46:52,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1075998.0, ans=0.125 2023-10-12 13:47:03,536 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.027e-02 2023-10-12 13:48:00,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.23 vs. limit=15.0 2023-10-12 13:48:05,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1076324.6666666667, ans=0.125 2023-10-12 13:48:05,587 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=22.5 2023-10-12 13:48:20,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.31 vs. limit=10.0 2023-10-12 13:48:32,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.726e+02 1.841e+02 2.196e+02 3.997e+02, threshold=3.683e+02, percent-clipped=1.0 2023-10-12 13:49:08,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.75 vs. limit=15.0 2023-10-12 13:49:10,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1076604.6666666667, ans=0.125 2023-10-12 13:49:20,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1076651.3333333333, ans=0.125 2023-10-12 13:49:25,718 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.14 vs. limit=10.0 2023-10-12 13:49:40,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1076744.6666666667, ans=0.07 2023-10-12 13:49:46,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1076744.6666666667, ans=0.125 2023-10-12 13:49:54,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1076791.3333333333, ans=0.0 2023-10-12 13:50:09,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1076838.0, ans=0.125 2023-10-12 13:50:22,134 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.697e+02 1.860e+02 2.117e+02 2.878e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-12 13:50:22,796 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.09 vs. limit=22.5 2023-10-12 13:50:37,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1076978.0, ans=0.0 2023-10-12 13:50:44,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1076978.0, ans=0.0 2023-10-12 13:50:46,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-10-12 13:52:18,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1077351.3333333333, ans=0.125 2023-10-12 13:52:25,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.778e+02 1.970e+02 2.151e+02 3.385e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-12 13:52:53,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1077491.3333333333, ans=0.1 2023-10-12 13:53:06,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1077538.0, ans=0.125 2023-10-12 13:53:10,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1077538.0, ans=0.125 2023-10-12 13:53:18,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1077584.6666666667, ans=0.1 2023-10-12 13:53:56,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1077724.6666666667, ans=0.0 2023-10-12 13:54:06,495 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.59 vs. limit=10.0 2023-10-12 13:54:09,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1077771.3333333333, ans=0.125 2023-10-12 13:54:16,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1077818.0, ans=0.125 2023-10-12 13:54:22,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.732e+02 1.891e+02 2.114e+02 4.029e+02, threshold=3.783e+02, percent-clipped=1.0 2023-10-12 13:54:32,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1077864.6666666667, ans=0.2 2023-10-12 13:54:34,722 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.98 vs. limit=15.0 2023-10-12 13:54:35,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1077864.6666666667, ans=0.125 2023-10-12 13:54:37,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1077911.3333333333, ans=0.0 2023-10-12 13:54:38,231 INFO [train.py:1031] (0/4) Epoch 17, batch 12500, loss[loss=0.1703, simple_loss=0.2692, pruned_loss=0.03573, over 16876.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2841, pruned_loss=0.05113, over 32715113.16 frames. ], batch size: 93, lr: 2.02e-03, grad_scale: 16.0 2023-10-12 13:54:40,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=22.5 2023-10-12 13:54:58,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1077958.0, ans=0.05 2023-10-12 13:54:59,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1077958.0, ans=0.125 2023-10-12 13:55:09,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1078004.6666666667, ans=0.125 2023-10-12 13:55:13,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1078051.3333333333, ans=0.125 2023-10-12 13:55:16,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1078051.3333333333, ans=0.0 2023-10-12 13:55:17,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1078051.3333333333, ans=0.2 2023-10-12 13:55:18,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1078051.3333333333, ans=0.125 2023-10-12 13:55:32,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1078144.6666666667, ans=0.0 2023-10-12 13:55:46,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=12.0 2023-10-12 13:55:52,436 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.90 vs. limit=15.0 2023-10-12 13:56:14,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1078284.6666666667, ans=0.2 2023-10-12 13:56:15,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.795e+02 2.000e+02 2.282e+02 3.214e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-12 13:56:26,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1078331.3333333333, ans=0.0 2023-10-12 13:56:41,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1078424.6666666667, ans=0.125 2023-10-12 13:56:45,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1078424.6666666667, ans=0.125 2023-10-12 13:57:08,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1078518.0, ans=0.125 2023-10-12 13:57:18,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1078564.6666666667, ans=0.2 2023-10-12 13:57:36,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1078611.3333333333, ans=0.0 2023-10-12 13:57:44,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1078658.0, ans=0.125 2023-10-12 13:57:58,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1078704.6666666667, ans=0.1 2023-10-12 13:58:11,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.744e+02 1.905e+02 2.200e+02 3.736e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 13:58:17,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1078798.0, ans=0.1 2023-10-12 13:58:36,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1078891.3333333333, ans=0.125 2023-10-12 13:58:40,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1078891.3333333333, ans=0.2 2023-10-12 13:58:46,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1078938.0, ans=0.125 2023-10-12 13:59:04,487 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.93 vs. limit=15.0 2023-10-12 13:59:17,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1079078.0, ans=0.125 2023-10-12 13:59:45,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1079171.3333333333, ans=0.0 2023-10-12 14:00:01,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1079218.0, ans=15.0 2023-10-12 14:00:01,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.703e+02 1.890e+02 2.203e+02 3.691e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-12 14:00:16,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1079311.3333333333, ans=0.125 2023-10-12 14:00:24,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1079311.3333333333, ans=0.125 2023-10-12 14:00:25,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1079311.3333333333, ans=0.125 2023-10-12 14:00:46,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1079404.6666666667, ans=0.125 2023-10-12 14:00:53,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1079451.3333333333, ans=0.125 2023-10-12 14:01:05,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.38 vs. limit=15.0 2023-10-12 14:01:07,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.94 vs. limit=15.0 2023-10-12 14:01:32,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.52 vs. limit=12.0 2023-10-12 14:01:56,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1079684.6666666667, ans=0.0 2023-10-12 14:01:59,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.722e+02 1.835e+02 2.053e+02 4.548e+02, threshold=3.669e+02, percent-clipped=1.0 2023-10-12 14:02:00,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1079684.6666666667, ans=0.5 2023-10-12 14:02:15,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1079778.0, ans=0.0 2023-10-12 14:02:32,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1079824.6666666667, ans=0.125 2023-10-12 14:02:43,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1079871.3333333333, ans=0.125 2023-10-12 14:02:55,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1079918.0, ans=0.125 2023-10-12 14:03:11,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1080011.3333333333, ans=0.125 2023-10-12 14:03:12,740 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.16 vs. limit=15.0 2023-10-12 14:03:19,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1080011.3333333333, ans=0.125 2023-10-12 14:03:20,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1080058.0, ans=0.2 2023-10-12 14:03:27,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.56 vs. limit=15.0 2023-10-12 14:03:53,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.726e+02 1.851e+02 2.039e+02 2.782e+02, threshold=3.701e+02, percent-clipped=0.0 2023-10-12 14:04:07,856 INFO [train.py:1031] (0/4) Epoch 17, batch 13000, loss[loss=0.2022, simple_loss=0.2883, pruned_loss=0.05809, over 16911.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2846, pruned_loss=0.05121, over 32744480.96 frames. ], batch size: 82, lr: 2.02e-03, grad_scale: 16.0 2023-10-12 14:04:11,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1080244.6666666667, ans=0.125 2023-10-12 14:04:22,840 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.96 vs. limit=5.0 2023-10-12 14:04:31,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1080338.0, ans=0.0 2023-10-12 14:04:33,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1080338.0, ans=0.125 2023-10-12 14:04:41,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1080338.0, ans=0.0 2023-10-12 14:05:00,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=1080431.3333333333, ans=0.1 2023-10-12 14:05:03,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1080431.3333333333, ans=0.0 2023-10-12 14:05:16,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1080478.0, ans=0.125 2023-10-12 14:05:16,833 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.56 vs. limit=15.0 2023-10-12 14:05:31,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1080524.6666666667, ans=0.0 2023-10-12 14:05:57,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.784e+02 1.973e+02 2.241e+02 3.500e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-12 14:05:58,882 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=12.0 2023-10-12 14:06:31,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1080758.0, ans=0.125 2023-10-12 14:06:51,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1080851.3333333333, ans=0.125 2023-10-12 14:07:21,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1080991.3333333333, ans=0.125 2023-10-12 14:07:56,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1081084.6666666667, ans=0.1 2023-10-12 14:07:57,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.638e+02 1.827e+02 2.008e+02 2.937e+02, threshold=3.654e+02, percent-clipped=0.0 2023-10-12 14:08:06,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1081131.3333333333, ans=0.125 2023-10-12 14:08:08,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1081131.3333333333, ans=0.0 2023-10-12 14:08:20,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1081178.0, ans=0.2 2023-10-12 14:08:24,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1081224.6666666667, ans=6.0 2023-10-12 14:08:39,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1081271.3333333333, ans=0.125 2023-10-12 14:08:43,705 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:08:51,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1081318.0, ans=0.0 2023-10-12 14:08:55,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1081318.0, ans=0.125 2023-10-12 14:09:01,575 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.41 vs. limit=10.0 2023-10-12 14:09:02,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1081364.6666666667, ans=0.1 2023-10-12 14:09:05,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.42 vs. limit=22.5 2023-10-12 14:09:25,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.63 vs. limit=15.0 2023-10-12 14:09:38,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1081504.6666666667, ans=0.125 2023-10-12 14:09:40,744 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:09:43,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1081551.3333333333, ans=0.1 2023-10-12 14:09:52,431 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.744e+02 1.858e+02 2.053e+02 2.712e+02, threshold=3.717e+02, percent-clipped=0.0 2023-10-12 14:10:09,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1081644.6666666667, ans=0.0 2023-10-12 14:10:10,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1081644.6666666667, ans=15.0 2023-10-12 14:10:18,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1081691.3333333333, ans=0.0 2023-10-12 14:10:23,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=12.0 2023-10-12 14:10:28,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1081738.0, ans=0.125 2023-10-12 14:10:36,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1081738.0, ans=0.125 2023-10-12 14:10:51,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1081784.6666666667, ans=0.05 2023-10-12 14:10:55,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1081831.3333333333, ans=0.125 2023-10-12 14:11:06,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1081878.0, ans=0.0 2023-10-12 14:11:10,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1081878.0, ans=0.125 2023-10-12 14:11:11,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1081878.0, ans=0.0 2023-10-12 14:11:12,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=22.5 2023-10-12 14:11:13,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1081924.6666666667, ans=0.125 2023-10-12 14:11:41,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1082018.0, ans=0.0 2023-10-12 14:11:42,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1082018.0, ans=0.1 2023-10-12 14:11:47,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.280e+02 1.777e+02 1.922e+02 2.099e+02 2.846e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-12 14:11:55,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1082064.6666666667, ans=0.04949747468305833 2023-10-12 14:12:21,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1082158.0, ans=0.125 2023-10-12 14:12:26,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1082204.6666666667, ans=0.125 2023-10-12 14:12:42,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1082251.3333333333, ans=0.2 2023-10-12 14:12:54,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.03 vs. limit=12.0 2023-10-12 14:13:12,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1082391.3333333333, ans=0.0 2023-10-12 14:13:17,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.80 vs. limit=15.0 2023-10-12 14:13:31,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1082484.6666666667, ans=0.125 2023-10-12 14:13:40,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.709e+02 1.975e+02 2.183e+02 3.152e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-12 14:13:48,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1082531.3333333333, ans=0.125 2023-10-12 14:13:55,987 INFO [train.py:1031] (0/4) Epoch 17, batch 13500, loss[loss=0.2124, simple_loss=0.2962, pruned_loss=0.06426, over 16097.00 frames. ], tot_loss[loss=0.193, simple_loss=0.284, pruned_loss=0.05097, over 32780367.78 frames. ], batch size: 296, lr: 2.02e-03, grad_scale: 16.0 2023-10-12 14:14:14,382 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-232000.pt 2023-10-12 14:14:35,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1082718.0, ans=0.125 2023-10-12 14:14:39,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1082718.0, ans=0.0 2023-10-12 14:14:54,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1082811.3333333333, ans=0.125 2023-10-12 14:15:08,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1082858.0, ans=0.0 2023-10-12 14:15:21,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1082904.6666666667, ans=0.2 2023-10-12 14:15:36,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1082951.3333333333, ans=0.125 2023-10-12 14:15:38,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.756e+02 2.008e+02 2.458e+02 4.260e+02, threshold=4.016e+02, percent-clipped=1.0 2023-10-12 14:15:41,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1082998.0, ans=0.125 2023-10-12 14:15:52,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1083044.6666666667, ans=0.125 2023-10-12 14:16:05,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1083091.3333333333, ans=0.1 2023-10-12 14:16:22,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1083184.6666666667, ans=0.125 2023-10-12 14:16:27,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1083184.6666666667, ans=0.1 2023-10-12 14:16:27,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1083184.6666666667, ans=0.125 2023-10-12 14:16:33,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1083231.3333333333, ans=0.125 2023-10-12 14:16:33,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1083231.3333333333, ans=0.125 2023-10-12 14:16:45,032 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-17.pt 2023-10-12 14:17:22,112 INFO [train.py:1031] (0/4) Epoch 18, batch 0, loss[loss=0.1651, simple_loss=0.2603, pruned_loss=0.03495, over 16865.00 frames. ], tot_loss[loss=0.1651, simple_loss=0.2603, pruned_loss=0.03495, over 16865.00 frames. ], batch size: 110, lr: 1.96e-03, grad_scale: 32.0 2023-10-12 14:17:22,112 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-12 14:17:29,949 INFO [train.py:1063] (0/4) Epoch 18, validation: loss=0.2151, simple_loss=0.3024, pruned_loss=0.06384, over 1020973.00 frames. 2023-10-12 14:17:29,950 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-12 14:17:33,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1083301.3333333333, ans=0.125 2023-10-12 14:17:34,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1083301.3333333333, ans=0.0 2023-10-12 14:18:11,935 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.679e+02 1.904e+02 2.260e+02 3.526e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-12 14:18:23,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1083488.0, ans=0.1 2023-10-12 14:18:25,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1083488.0, ans=0.125 2023-10-12 14:18:37,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=12.15 vs. limit=12.0 2023-10-12 14:18:55,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1083628.0, ans=0.125 2023-10-12 14:19:02,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1083628.0, ans=0.125 2023-10-12 14:19:05,279 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.57 vs. limit=15.0 2023-10-12 14:19:08,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1083674.6666666667, ans=0.0 2023-10-12 14:19:11,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1083674.6666666667, ans=10.0 2023-10-12 14:19:23,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1083721.3333333333, ans=0.0 2023-10-12 14:19:33,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1083768.0, ans=0.0 2023-10-12 14:20:03,513 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:20:06,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.651e+02 1.848e+02 2.026e+02 2.543e+02, threshold=3.697e+02, percent-clipped=0.0 2023-10-12 14:20:44,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1084048.0, ans=0.04949747468305833 2023-10-12 14:20:48,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1084048.0, ans=0.0 2023-10-12 14:21:05,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1084141.3333333333, ans=0.125 2023-10-12 14:21:11,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1084141.3333333333, ans=10.0 2023-10-12 14:21:18,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.71 vs. limit=15.0 2023-10-12 14:21:30,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-10-12 14:21:40,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1084281.3333333333, ans=0.1 2023-10-12 14:22:01,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.745e+02 1.941e+02 2.216e+02 3.290e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-12 14:22:04,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1084374.6666666667, ans=0.125 2023-10-12 14:22:15,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1084421.3333333333, ans=0.125 2023-10-12 14:22:16,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1084421.3333333333, ans=0.0 2023-10-12 14:22:41,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1084514.6666666667, ans=0.125 2023-10-12 14:22:44,215 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:22:44,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1084514.6666666667, ans=0.95 2023-10-12 14:22:49,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1084561.3333333333, ans=0.1 2023-10-12 14:23:00,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1084608.0, ans=0.0 2023-10-12 14:23:06,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1084608.0, ans=0.1 2023-10-12 14:23:16,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.20 vs. limit=22.5 2023-10-12 14:23:16,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1084654.6666666667, ans=0.0 2023-10-12 14:23:28,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1084701.3333333333, ans=0.0 2023-10-12 14:23:28,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-10-12 14:23:46,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1084794.6666666667, ans=0.125 2023-10-12 14:23:52,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=4.93 vs. limit=15.0 2023-10-12 14:23:58,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.383e+02 1.717e+02 1.944e+02 2.111e+02 2.641e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-12 14:24:02,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.61 vs. limit=22.5 2023-10-12 14:24:05,998 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=27.52 vs. limit=22.5 2023-10-12 14:24:11,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1084888.0, ans=0.125 2023-10-12 14:24:12,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1084888.0, ans=0.125 2023-10-12 14:24:25,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-10-12 14:24:26,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1084981.3333333333, ans=0.125 2023-10-12 14:24:35,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=15.0 2023-10-12 14:25:06,138 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:25:08,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1085121.3333333333, ans=0.125 2023-10-12 14:25:19,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1085168.0, ans=0.2 2023-10-12 14:25:37,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1085261.3333333333, ans=0.0 2023-10-12 14:25:50,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.816e+02 1.984e+02 2.210e+02 3.172e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-12 14:26:15,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1085401.3333333333, ans=0.0 2023-10-12 14:26:15,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1085401.3333333333, ans=0.0 2023-10-12 14:26:27,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1085448.0, ans=0.125 2023-10-12 14:26:42,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.88 vs. limit=22.5 2023-10-12 14:26:47,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.70 vs. limit=15.0 2023-10-12 14:26:54,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1085541.3333333333, ans=10.0 2023-10-12 14:27:02,096 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.13 vs. limit=10.0 2023-10-12 14:27:06,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.51 vs. limit=12.0 2023-10-12 14:27:10,618 INFO [train.py:1031] (0/4) Epoch 18, batch 500, loss[loss=0.1816, simple_loss=0.275, pruned_loss=0.04415, over 16808.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2835, pruned_loss=0.05084, over 7277977.39 frames. ], batch size: 146, lr: 1.96e-03, grad_scale: 32.0 2023-10-12 14:27:10,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1085634.6666666667, ans=0.0 2023-10-12 14:27:31,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1085681.3333333333, ans=0.125 2023-10-12 14:27:37,234 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.10 vs. limit=10.0 2023-10-12 14:27:45,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1085774.6666666667, ans=0.0 2023-10-12 14:27:48,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.763e+02 1.930e+02 2.167e+02 2.962e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-12 14:27:50,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1085774.6666666667, ans=0.05 2023-10-12 14:28:01,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1085821.3333333333, ans=0.1 2023-10-12 14:28:18,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1085914.6666666667, ans=0.125 2023-10-12 14:28:22,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1085914.6666666667, ans=0.125 2023-10-12 14:28:38,341 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:28:49,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1086008.0, ans=0.2 2023-10-12 14:28:56,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1086054.6666666667, ans=0.125 2023-10-12 14:28:57,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1086054.6666666667, ans=0.125 2023-10-12 14:29:01,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1086054.6666666667, ans=0.125 2023-10-12 14:29:18,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1086148.0, ans=0.1 2023-10-12 14:29:28,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1086194.6666666667, ans=0.0 2023-10-12 14:29:40,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.805e+02 1.989e+02 2.229e+02 3.113e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-12 14:29:57,122 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.96 vs. limit=15.0 2023-10-12 14:30:01,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1086334.6666666667, ans=0.0 2023-10-12 14:30:04,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1086334.6666666667, ans=0.125 2023-10-12 14:30:05,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.02 vs. limit=15.0 2023-10-12 14:30:19,496 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.60 vs. limit=15.0 2023-10-12 14:30:26,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1086428.0, ans=0.1 2023-10-12 14:30:31,215 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-10-12 14:30:54,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1086568.0, ans=0.125 2023-10-12 14:31:09,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1086614.6666666667, ans=0.125 2023-10-12 14:31:25,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1086708.0, ans=0.125 2023-10-12 14:31:26,592 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.752e+02 1.928e+02 2.178e+02 3.659e+02, threshold=3.855e+02, percent-clipped=0.0 2023-10-12 14:31:28,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1086708.0, ans=0.125 2023-10-12 14:31:34,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=12.0 2023-10-12 14:31:44,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1086754.6666666667, ans=0.125 2023-10-12 14:31:50,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1086801.3333333333, ans=0.125 2023-10-12 14:31:56,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1086801.3333333333, ans=0.1 2023-10-12 14:32:19,898 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:32:26,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1086941.3333333333, ans=0.1 2023-10-12 14:32:54,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1087081.3333333333, ans=0.0 2023-10-12 14:32:57,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1087081.3333333333, ans=0.125 2023-10-12 14:32:59,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1087081.3333333333, ans=0.1 2023-10-12 14:33:15,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1087128.0, ans=0.125 2023-10-12 14:33:16,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087128.0, ans=0.1 2023-10-12 14:33:29,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.777e+02 1.924e+02 2.195e+02 3.258e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-12 14:33:44,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.96 vs. limit=15.0 2023-10-12 14:33:45,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087268.0, ans=0.1 2023-10-12 14:33:46,580 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.77 vs. limit=22.5 2023-10-12 14:33:53,582 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-10-12 14:34:15,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1087361.3333333333, ans=0.0 2023-10-12 14:34:15,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1087361.3333333333, ans=0.0 2023-10-12 14:34:17,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.14 vs. limit=15.0 2023-10-12 14:34:29,916 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-10-12 14:34:51,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087548.0, ans=0.1 2023-10-12 14:35:06,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1087594.6666666667, ans=0.0 2023-10-12 14:35:06,367 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.63 vs. limit=15.0 2023-10-12 14:35:08,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1087594.6666666667, ans=0.04949747468305833 2023-10-12 14:35:18,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.711e+02 1.880e+02 2.052e+02 2.877e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-12 14:35:21,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1087688.0, ans=0.1 2023-10-12 14:35:46,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1087734.6666666667, ans=0.02 2023-10-12 14:35:46,602 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.31 vs. limit=15.0 2023-10-12 14:36:05,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1087828.0, ans=0.1 2023-10-12 14:36:11,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=8.0 2023-10-12 14:36:16,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1087874.6666666667, ans=0.125 2023-10-12 14:36:28,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=12.0 2023-10-12 14:36:30,741 INFO [train.py:1031] (0/4) Epoch 18, batch 1000, loss[loss=0.1899, simple_loss=0.2856, pruned_loss=0.04711, over 16878.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2848, pruned_loss=0.05114, over 12962875.00 frames. ], batch size: 175, lr: 1.96e-03, grad_scale: 16.0 2023-10-12 14:36:32,370 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.51 vs. limit=5.0 2023-10-12 14:36:45,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1088014.6666666667, ans=0.125 2023-10-12 14:36:46,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-10-12 14:36:59,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1088061.3333333333, ans=0.125 2023-10-12 14:37:04,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1088108.0, ans=0.0 2023-10-12 14:37:06,525 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:37:09,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.670e+02 1.852e+02 2.080e+02 2.940e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-12 14:37:09,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1088108.0, ans=0.0 2023-10-12 14:37:20,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1088154.6666666667, ans=0.125 2023-10-12 14:37:37,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1088248.0, ans=0.0 2023-10-12 14:37:54,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1088341.3333333333, ans=0.0 2023-10-12 14:37:54,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1088341.3333333333, ans=0.125 2023-10-12 14:38:00,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1088341.3333333333, ans=0.05 2023-10-12 14:38:00,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1088341.3333333333, ans=0.2 2023-10-12 14:38:07,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1088388.0, ans=0.5 2023-10-12 14:38:34,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1088481.3333333333, ans=0.125 2023-10-12 14:38:57,247 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=12.0 2023-10-12 14:38:59,204 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.774e+02 1.916e+02 2.114e+02 3.248e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-12 14:38:59,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1088574.6666666667, ans=0.125 2023-10-12 14:39:07,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1088621.3333333333, ans=0.125 2023-10-12 14:39:31,822 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-10-12 14:40:35,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1088948.0, ans=0.1 2023-10-12 14:40:47,545 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-10-12 14:40:48,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.93 vs. limit=22.5 2023-10-12 14:40:58,863 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.687e+02 1.838e+02 2.024e+02 2.738e+02, threshold=3.677e+02, percent-clipped=0.0 2023-10-12 14:41:17,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1089134.6666666667, ans=0.0 2023-10-12 14:41:20,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1089134.6666666667, ans=0.2 2023-10-12 14:41:42,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1089228.0, ans=0.0 2023-10-12 14:41:45,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1089274.6666666667, ans=0.125 2023-10-12 14:41:45,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1089274.6666666667, ans=0.125 2023-10-12 14:41:46,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1089274.6666666667, ans=0.0 2023-10-12 14:41:59,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1089321.3333333333, ans=0.0 2023-10-12 14:42:19,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1089414.6666666667, ans=0.0 2023-10-12 14:42:22,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.24 vs. limit=22.5 2023-10-12 14:42:26,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1089414.6666666667, ans=0.5 2023-10-12 14:42:27,357 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:42:50,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.708e+02 1.868e+02 1.993e+02 2.854e+02, threshold=3.735e+02, percent-clipped=0.0 2023-10-12 14:43:18,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1089648.0, ans=0.0 2023-10-12 14:43:24,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1089694.6666666667, ans=0.025 2023-10-12 14:43:41,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1089741.3333333333, ans=0.125 2023-10-12 14:43:43,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1089741.3333333333, ans=0.125 2023-10-12 14:43:49,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1089788.0, ans=0.1 2023-10-12 14:44:40,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.697e+02 1.832e+02 2.032e+02 3.557e+02, threshold=3.665e+02, percent-clipped=0.0 2023-10-12 14:44:54,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1090021.3333333333, ans=0.1 2023-10-12 14:45:04,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1090068.0, ans=0.125 2023-10-12 14:45:04,413 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:45:05,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1090114.6666666667, ans=0.125 2023-10-12 14:45:05,815 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.00 vs. limit=15.0 2023-10-12 14:45:34,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1090208.0, ans=0.2 2023-10-12 14:45:36,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1090208.0, ans=0.2 2023-10-12 14:45:46,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1090254.6666666667, ans=0.1 2023-10-12 14:45:52,538 INFO [train.py:1031] (0/4) Epoch 18, batch 1500, loss[loss=0.2425, simple_loss=0.3063, pruned_loss=0.08939, over 15569.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.283, pruned_loss=0.05041, over 17354130.91 frames. ], batch size: 350, lr: 1.95e-03, grad_scale: 16.0 2023-10-12 14:45:57,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1090301.3333333333, ans=0.1 2023-10-12 14:46:01,193 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=12.0 2023-10-12 14:46:05,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1090348.0, ans=0.025 2023-10-12 14:46:06,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1090348.0, ans=0.125 2023-10-12 14:46:33,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.715e+02 1.909e+02 2.124e+02 2.625e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-12 14:46:35,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1090488.0, ans=0.125 2023-10-12 14:46:40,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1090488.0, ans=0.0 2023-10-12 14:46:41,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1090488.0, ans=0.125 2023-10-12 14:47:00,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1090581.3333333333, ans=0.125 2023-10-12 14:47:16,558 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-10-12 14:47:20,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1090628.0, ans=0.1 2023-10-12 14:47:33,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1090721.3333333333, ans=0.0 2023-10-12 14:47:40,137 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:47:46,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1090768.0, ans=0.125 2023-10-12 14:47:57,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1090814.6666666667, ans=0.125 2023-10-12 14:47:59,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1090814.6666666667, ans=0.0 2023-10-12 14:48:02,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1090814.6666666667, ans=0.2 2023-10-12 14:48:05,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1090814.6666666667, ans=0.0 2023-10-12 14:48:06,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1090814.6666666667, ans=0.125 2023-10-12 14:48:26,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.774e+02 1.886e+02 2.156e+02 3.073e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-12 14:48:34,737 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.14 vs. limit=15.0 2023-10-12 14:48:47,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1091001.3333333333, ans=0.0 2023-10-12 14:48:50,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1091001.3333333333, ans=0.125 2023-10-12 14:48:58,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1091048.0, ans=0.125 2023-10-12 14:49:05,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1091048.0, ans=0.125 2023-10-12 14:49:13,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1091094.6666666667, ans=0.125 2023-10-12 14:49:14,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1091094.6666666667, ans=0.07 2023-10-12 14:49:16,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1091094.6666666667, ans=0.125 2023-10-12 14:49:23,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.19 vs. limit=15.0 2023-10-12 14:49:26,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1091141.3333333333, ans=0.125 2023-10-12 14:49:27,140 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:49:32,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1091188.0, ans=0.125 2023-10-12 14:49:38,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1091188.0, ans=0.02 2023-10-12 14:49:50,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1091281.3333333333, ans=0.125 2023-10-12 14:50:11,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1091374.6666666667, ans=0.2 2023-10-12 14:50:19,511 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.793e+02 2.008e+02 2.390e+02 3.613e+02, threshold=4.016e+02, percent-clipped=0.0 2023-10-12 14:50:24,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-10-12 14:50:33,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1091468.0, ans=0.125 2023-10-12 14:50:58,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1091561.3333333333, ans=0.0 2023-10-12 14:51:00,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1091561.3333333333, ans=0.125 2023-10-12 14:51:08,449 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.51 vs. limit=12.0 2023-10-12 14:51:17,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1091608.0, ans=0.1 2023-10-12 14:51:22,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1091654.6666666667, ans=0.125 2023-10-12 14:51:33,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1091701.3333333333, ans=0.125 2023-10-12 14:51:37,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1091701.3333333333, ans=0.125 2023-10-12 14:51:53,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.92 vs. limit=22.5 2023-10-12 14:51:55,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.55 vs. limit=15.0 2023-10-12 14:51:56,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1091794.6666666667, ans=0.0 2023-10-12 14:52:01,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1091794.6666666667, ans=0.95 2023-10-12 14:52:03,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1091841.3333333333, ans=0.035 2023-10-12 14:52:12,840 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.722e+02 1.895e+02 2.159e+02 2.986e+02, threshold=3.789e+02, percent-clipped=0.0 2023-10-12 14:52:15,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1091888.0, ans=0.0 2023-10-12 14:52:18,590 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.089e-02 2023-10-12 14:52:45,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1091981.3333333333, ans=0.2 2023-10-12 14:52:50,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1092028.0, ans=0.1 2023-10-12 14:52:53,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1092028.0, ans=0.0 2023-10-12 14:52:57,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1092028.0, ans=0.1 2023-10-12 14:52:58,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1092074.6666666667, ans=0.0 2023-10-12 14:53:02,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1092074.6666666667, ans=0.0 2023-10-12 14:53:17,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1092121.3333333333, ans=0.07 2023-10-12 14:53:38,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-10-12 14:54:07,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.753e+02 1.939e+02 2.143e+02 3.142e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-12 14:54:16,168 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-10-12 14:54:20,306 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-10-12 14:54:26,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1092401.3333333333, ans=10.0 2023-10-12 14:54:39,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1092448.0, ans=0.2 2023-10-12 14:54:43,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1092448.0, ans=0.09899494936611666 2023-10-12 14:55:05,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1092541.3333333333, ans=0.0 2023-10-12 14:55:10,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1092541.3333333333, ans=0.125 2023-10-12 14:55:23,845 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-10-12 14:55:24,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1092588.0, ans=0.0 2023-10-12 14:55:28,246 INFO [train.py:1031] (0/4) Epoch 18, batch 2000, loss[loss=0.1956, simple_loss=0.2892, pruned_loss=0.05101, over 16540.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2834, pruned_loss=0.05078, over 20729653.18 frames. ], batch size: 266, lr: 1.95e-03, grad_scale: 32.0 2023-10-12 14:55:39,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1092681.3333333333, ans=0.125 2023-10-12 14:56:05,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=1092728.0, ans=0.05 2023-10-12 14:56:19,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.736e+02 1.889e+02 2.096e+02 2.651e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-12 14:56:23,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1092821.3333333333, ans=0.125 2023-10-12 14:56:29,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.81 vs. limit=15.0 2023-10-12 14:57:06,036 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.13 vs. limit=10.0 2023-10-12 14:57:06,594 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 14:57:14,336 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.86 vs. limit=22.5 2023-10-12 14:57:52,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1093148.0, ans=0.125 2023-10-12 14:58:33,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1093241.3333333333, ans=0.0 2023-10-12 14:58:36,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.716e+02 1.851e+02 2.138e+02 3.189e+02, threshold=3.703e+02, percent-clipped=0.0 2023-10-12 14:59:07,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1093381.3333333333, ans=0.5 2023-10-12 14:59:19,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1093428.0, ans=0.0 2023-10-12 14:59:20,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-10-12 14:59:23,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1093428.0, ans=0.0 2023-10-12 14:59:27,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1093474.6666666667, ans=0.0 2023-10-12 14:59:37,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1093474.6666666667, ans=0.125 2023-10-12 14:59:50,119 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-10-12 14:59:57,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1093568.0, ans=0.95 2023-10-12 14:59:57,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1093568.0, ans=0.125 2023-10-12 14:59:57,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1093568.0, ans=0.1 2023-10-12 14:59:58,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1093568.0, ans=0.125 2023-10-12 15:00:01,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1093614.6666666667, ans=0.125 2023-10-12 15:00:02,729 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=12.0 2023-10-12 15:00:10,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1093614.6666666667, ans=0.0 2023-10-12 15:00:27,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1093708.0, ans=0.125 2023-10-12 15:00:30,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1093708.0, ans=0.1 2023-10-12 15:00:31,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1093708.0, ans=0.1 2023-10-12 15:00:33,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.770e+02 1.918e+02 2.258e+02 2.909e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-12 15:00:33,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1093708.0, ans=0.1 2023-10-12 15:00:34,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1093754.6666666667, ans=0.125 2023-10-12 15:00:43,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1093754.6666666667, ans=0.125 2023-10-12 15:00:50,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1093801.3333333333, ans=15.0 2023-10-12 15:00:54,472 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.80 vs. limit=22.5 2023-10-12 15:00:56,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1093848.0, ans=0.0 2023-10-12 15:01:03,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1093848.0, ans=0.125 2023-10-12 15:01:04,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1093848.0, ans=0.05 2023-10-12 15:01:24,848 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.72 vs. limit=8.0 2023-10-12 15:01:40,292 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.83 vs. limit=15.0 2023-10-12 15:01:42,529 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.29 vs. limit=15.0 2023-10-12 15:02:11,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1094174.6666666667, ans=0.125 2023-10-12 15:02:15,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-10-12 15:02:21,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.758e+02 1.953e+02 2.167e+02 4.263e+02, threshold=3.905e+02, percent-clipped=1.0 2023-10-12 15:02:21,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1094174.6666666667, ans=0.125 2023-10-12 15:02:42,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1094268.0, ans=0.2 2023-10-12 15:02:46,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-10-12 15:03:06,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1094361.3333333333, ans=0.0 2023-10-12 15:03:24,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.62 vs. limit=15.0 2023-10-12 15:03:35,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1094501.3333333333, ans=0.125 2023-10-12 15:03:43,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1094548.0, ans=0.125 2023-10-12 15:03:46,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1094548.0, ans=0.125 2023-10-12 15:04:09,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1094641.3333333333, ans=6.0 2023-10-12 15:04:10,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.781e+02 1.923e+02 2.146e+02 3.420e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-12 15:04:21,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1094688.0, ans=0.125 2023-10-12 15:04:23,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1094734.6666666667, ans=0.125 2023-10-12 15:04:23,693 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:04:35,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.53 vs. limit=22.5 2023-10-12 15:04:45,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1094828.0, ans=0.125 2023-10-12 15:04:51,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1094828.0, ans=0.5 2023-10-12 15:04:54,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1094874.6666666667, ans=0.125 2023-10-12 15:05:15,043 INFO [train.py:1031] (0/4) Epoch 18, batch 2500, loss[loss=0.1955, simple_loss=0.287, pruned_loss=0.05201, over 16751.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2836, pruned_loss=0.05094, over 23396160.41 frames. ], batch size: 202, lr: 1.95e-03, grad_scale: 16.0 2023-10-12 15:05:21,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.92 vs. limit=15.0 2023-10-12 15:05:22,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1094968.0, ans=0.125 2023-10-12 15:05:31,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1095014.6666666667, ans=0.0 2023-10-12 15:05:46,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1095108.0, ans=0.1 2023-10-12 15:05:57,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.823e+02 1.958e+02 2.161e+02 2.761e+02, threshold=3.915e+02, percent-clipped=0.0 2023-10-12 15:06:02,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1095154.6666666667, ans=0.0 2023-10-12 15:06:06,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1095154.6666666667, ans=0.1 2023-10-12 15:06:10,549 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.58 vs. limit=22.5 2023-10-12 15:06:14,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1095201.3333333333, ans=0.125 2023-10-12 15:06:21,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1095248.0, ans=0.0 2023-10-12 15:06:22,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1095248.0, ans=0.125 2023-10-12 15:06:23,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1095248.0, ans=0.125 2023-10-12 15:06:34,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1095294.6666666667, ans=0.125 2023-10-12 15:06:36,070 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-10-12 15:06:57,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1095388.0, ans=0.125 2023-10-12 15:06:57,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1095388.0, ans=0.125 2023-10-12 15:07:28,681 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.65 vs. limit=15.0 2023-10-12 15:07:33,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1095528.0, ans=0.0 2023-10-12 15:07:33,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1095528.0, ans=0.125 2023-10-12 15:07:43,387 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2023-10-12 15:07:46,379 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.76 vs. limit=15.0 2023-10-12 15:07:47,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.787e+02 1.977e+02 2.203e+02 3.425e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-12 15:08:03,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1095668.0, ans=0.125 2023-10-12 15:08:14,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1095714.6666666667, ans=0.125 2023-10-12 15:08:16,808 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:08:29,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1095808.0, ans=0.0 2023-10-12 15:08:34,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1095808.0, ans=0.2 2023-10-12 15:08:40,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=1095854.6666666667, ans=0.2 2023-10-12 15:08:43,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.36 vs. limit=15.0 2023-10-12 15:08:46,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1095854.6666666667, ans=0.125 2023-10-12 15:08:59,412 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-10-12 15:09:05,828 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.38 vs. limit=15.0 2023-10-12 15:09:10,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1095948.0, ans=0.0 2023-10-12 15:09:22,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1095994.6666666667, ans=0.125 2023-10-12 15:09:35,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1096041.3333333333, ans=0.0 2023-10-12 15:09:39,304 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.720e+02 1.848e+02 2.077e+02 3.247e+02, threshold=3.696e+02, percent-clipped=0.0 2023-10-12 15:09:51,911 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.23 vs. limit=10.0 2023-10-12 15:09:54,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096134.6666666667, ans=0.1 2023-10-12 15:11:27,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1096508.0, ans=0.125 2023-10-12 15:11:40,796 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.692e+02 1.857e+02 2.098e+02 2.936e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-12 15:11:43,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1096554.6666666667, ans=0.125 2023-10-12 15:11:48,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1096554.6666666667, ans=0.2 2023-10-12 15:11:53,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1096601.3333333333, ans=0.0 2023-10-12 15:12:16,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1096648.0, ans=15.0 2023-10-12 15:12:23,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1096694.6666666667, ans=0.0 2023-10-12 15:13:18,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096928.0, ans=0.1 2023-10-12 15:13:31,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1096974.6666666667, ans=0.05 2023-10-12 15:13:32,860 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-10-12 15:13:39,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1096974.6666666667, ans=0.0 2023-10-12 15:13:42,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.685e+02 1.929e+02 2.097e+02 2.599e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-12 15:14:03,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1097114.6666666667, ans=0.0 2023-10-12 15:14:06,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1097114.6666666667, ans=0.125 2023-10-12 15:14:11,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1097114.6666666667, ans=0.0 2023-10-12 15:14:41,569 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.61 vs. limit=22.5 2023-10-12 15:14:44,988 INFO [train.py:1031] (0/4) Epoch 18, batch 3000, loss[loss=0.2108, simple_loss=0.2974, pruned_loss=0.06208, over 16932.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2829, pruned_loss=0.05067, over 25526122.62 frames. ], batch size: 110, lr: 1.95e-03, grad_scale: 16.0 2023-10-12 15:14:54,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1097301.3333333333, ans=0.5 2023-10-12 15:15:05,477 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.76 vs. limit=15.0 2023-10-12 15:15:26,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1097441.3333333333, ans=0.125 2023-10-12 15:15:27,755 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.735e+02 1.860e+02 2.086e+02 2.719e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-12 15:15:32,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1097488.0, ans=0.2 2023-10-12 15:15:45,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1097534.6666666667, ans=0.125 2023-10-12 15:15:48,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1097534.6666666667, ans=0.0 2023-10-12 15:15:48,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1097534.6666666667, ans=0.2 2023-10-12 15:15:58,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1097581.3333333333, ans=0.125 2023-10-12 15:16:01,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.53 vs. limit=15.0 2023-10-12 15:16:04,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=12.0 2023-10-12 15:16:14,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1097674.6666666667, ans=0.125 2023-10-12 15:16:33,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1097721.3333333333, ans=0.0 2023-10-12 15:16:44,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1097768.0, ans=0.0 2023-10-12 15:17:27,326 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.718e+02 1.888e+02 2.090e+02 3.235e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 15:17:48,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1098048.0, ans=0.125 2023-10-12 15:18:01,717 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-10-12 15:18:08,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.67 vs. limit=22.5 2023-10-12 15:18:09,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1098141.3333333333, ans=0.0 2023-10-12 15:18:12,020 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.56 vs. limit=22.5 2023-10-12 15:18:21,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-10-12 15:18:31,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.67 vs. limit=22.5 2023-10-12 15:18:31,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1098234.6666666667, ans=0.0 2023-10-12 15:18:33,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.24 vs. limit=10.0 2023-10-12 15:18:39,500 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.33 vs. limit=22.5 2023-10-12 15:19:23,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.754e+02 1.912e+02 2.118e+02 2.880e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-12 15:19:27,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1098421.3333333333, ans=0.0 2023-10-12 15:19:32,203 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2023-10-12 15:19:57,875 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:20:00,430 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.57 vs. limit=22.5 2023-10-12 15:20:38,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1098701.3333333333, ans=0.0 2023-10-12 15:20:44,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1098701.3333333333, ans=0.125 2023-10-12 15:21:04,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1098794.6666666667, ans=0.1 2023-10-12 15:21:05,723 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-10-12 15:21:08,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1098841.3333333333, ans=0.035 2023-10-12 15:21:09,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1098841.3333333333, ans=0.0 2023-10-12 15:21:12,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1098841.3333333333, ans=0.1 2023-10-12 15:21:13,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1098841.3333333333, ans=0.125 2023-10-12 15:21:16,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1098888.0, ans=0.125 2023-10-12 15:21:17,465 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.805e+02 2.027e+02 2.270e+02 3.294e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-12 15:21:28,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-10-12 15:21:31,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1098934.6666666667, ans=0.125 2023-10-12 15:21:35,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1098934.6666666667, ans=0.0 2023-10-12 15:21:50,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1099028.0, ans=0.09899494936611666 2023-10-12 15:22:28,043 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:22:29,677 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:22:30,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1099168.0, ans=0.0 2023-10-12 15:22:32,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1099168.0, ans=0.125 2023-10-12 15:22:40,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1099214.6666666667, ans=0.125 2023-10-12 15:22:56,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1099261.3333333333, ans=0.1 2023-10-12 15:22:57,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-10-12 15:23:01,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.09 vs. limit=22.5 2023-10-12 15:23:11,831 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.736e+02 1.877e+02 2.090e+02 3.044e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-12 15:23:23,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1099401.3333333333, ans=0.0 2023-10-12 15:23:23,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1099401.3333333333, ans=0.125 2023-10-12 15:23:31,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1099448.0, ans=0.0 2023-10-12 15:23:57,441 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.54 vs. limit=22.5 2023-10-12 15:24:02,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-10-12 15:24:12,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1099588.0, ans=0.0 2023-10-12 15:24:15,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1099634.6666666667, ans=0.05 2023-10-12 15:24:16,838 INFO [train.py:1031] (0/4) Epoch 18, batch 3500, loss[loss=0.1764, simple_loss=0.2745, pruned_loss=0.0391, over 16860.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2826, pruned_loss=0.05067, over 27125891.09 frames. ], batch size: 104, lr: 1.94e-03, grad_scale: 16.0 2023-10-12 15:24:43,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1099728.0, ans=0.125 2023-10-12 15:24:50,197 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.87 vs. limit=22.5 2023-10-12 15:24:56,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1099774.6666666667, ans=0.0 2023-10-12 15:24:59,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.741e+02 1.896e+02 2.170e+02 3.188e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-12 15:25:34,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1099961.3333333333, ans=0.2 2023-10-12 15:25:49,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1099961.3333333333, ans=0.125 2023-10-12 15:25:53,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1100008.0, ans=0.125 2023-10-12 15:26:07,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1100054.6666666667, ans=0.09899494936611666 2023-10-12 15:26:10,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1100054.6666666667, ans=0.07 2023-10-12 15:26:15,417 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.77 vs. limit=22.5 2023-10-12 15:26:21,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1100101.3333333333, ans=0.125 2023-10-12 15:26:24,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1100148.0, ans=0.2 2023-10-12 15:26:25,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1100148.0, ans=0.09899494936611666 2023-10-12 15:26:27,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1100148.0, ans=0.0 2023-10-12 15:26:29,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1100148.0, ans=0.125 2023-10-12 15:26:59,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.724e+02 1.916e+02 2.174e+02 3.079e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-12 15:27:19,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1100334.6666666667, ans=0.1 2023-10-12 15:27:38,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1100428.0, ans=0.125 2023-10-12 15:27:52,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1100521.3333333333, ans=0.125 2023-10-12 15:27:55,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1100521.3333333333, ans=0.0 2023-10-12 15:27:58,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1100521.3333333333, ans=0.04949747468305833 2023-10-12 15:28:23,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1100614.6666666667, ans=0.125 2023-10-12 15:28:36,619 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-10-12 15:28:49,662 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=12.0 2023-10-12 15:28:56,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.740e+02 1.882e+02 2.049e+02 2.993e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-12 15:29:14,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1100801.3333333333, ans=0.125 2023-10-12 15:29:38,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1100894.6666666667, ans=0.1 2023-10-12 15:29:54,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100988.0, ans=0.1 2023-10-12 15:29:57,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1100988.0, ans=0.0 2023-10-12 15:30:24,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1101081.3333333333, ans=0.125 2023-10-12 15:30:26,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=1101128.0, ans=10.0 2023-10-12 15:30:37,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1101174.6666666667, ans=0.0 2023-10-12 15:30:48,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.723e+02 1.873e+02 2.043e+02 2.937e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-12 15:30:59,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1101268.0, ans=0.125 2023-10-12 15:31:21,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1101361.3333333333, ans=0.1 2023-10-12 15:31:29,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1101361.3333333333, ans=0.2 2023-10-12 15:31:30,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1101408.0, ans=0.0 2023-10-12 15:31:37,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1101408.0, ans=0.125 2023-10-12 15:31:40,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1101408.0, ans=0.0 2023-10-12 15:32:02,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1101501.3333333333, ans=0.125 2023-10-12 15:32:14,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1101548.0, ans=0.0 2023-10-12 15:32:28,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1101641.3333333333, ans=0.125 2023-10-12 15:32:29,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.31 vs. limit=22.5 2023-10-12 15:32:33,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1101641.3333333333, ans=0.125 2023-10-12 15:32:35,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1101641.3333333333, ans=0.2 2023-10-12 15:32:36,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1101688.0, ans=0.07 2023-10-12 15:32:38,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.288e+02 1.687e+02 1.851e+02 2.056e+02 3.533e+02, threshold=3.703e+02, percent-clipped=0.0 2023-10-12 15:32:38,968 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.99 vs. limit=15.0 2023-10-12 15:32:42,859 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=8.0 2023-10-12 15:32:54,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=1101734.6666666667, ans=0.2 2023-10-12 15:33:05,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1101781.3333333333, ans=0.125 2023-10-12 15:33:06,844 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.04 vs. limit=15.0 2023-10-12 15:33:44,709 INFO [train.py:1031] (0/4) Epoch 18, batch 4000, loss[loss=0.2004, simple_loss=0.2895, pruned_loss=0.05564, over 17031.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2823, pruned_loss=0.05074, over 28384284.05 frames. ], batch size: 117, lr: 1.94e-03, grad_scale: 32.0 2023-10-12 15:34:15,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1102061.3333333333, ans=0.125 2023-10-12 15:34:15,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1102061.3333333333, ans=0.125 2023-10-12 15:34:20,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1102108.0, ans=0.125 2023-10-12 15:34:20,681 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.94 vs. limit=6.0 2023-10-12 15:34:27,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1102108.0, ans=0.125 2023-10-12 15:34:32,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.729e+02 1.841e+02 2.139e+02 3.085e+02, threshold=3.682e+02, percent-clipped=0.0 2023-10-12 15:34:45,769 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:35:11,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1102294.6666666667, ans=0.125 2023-10-12 15:35:23,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1102341.3333333333, ans=0.0 2023-10-12 15:35:33,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1102388.0, ans=0.2 2023-10-12 15:35:49,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1102481.3333333333, ans=0.1 2023-10-12 15:36:00,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1102528.0, ans=0.125 2023-10-12 15:36:25,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.820e+02 1.969e+02 2.213e+02 3.303e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-12 15:36:35,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1102668.0, ans=0.125 2023-10-12 15:36:40,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1102668.0, ans=0.125 2023-10-12 15:37:19,394 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:37:21,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1102808.0, ans=0.09899494936611666 2023-10-12 15:37:29,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1102808.0, ans=0.0 2023-10-12 15:37:44,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1102901.3333333333, ans=0.125 2023-10-12 15:37:55,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1102948.0, ans=0.1 2023-10-12 15:38:05,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1102994.6666666667, ans=0.2 2023-10-12 15:38:24,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1103041.3333333333, ans=0.125 2023-10-12 15:38:30,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.717e+02 1.855e+02 2.118e+02 2.850e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-12 15:38:55,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1103181.3333333333, ans=0.125 2023-10-12 15:39:02,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1103228.0, ans=0.1 2023-10-12 15:39:05,060 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-10-12 15:39:06,143 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.08 vs. limit=15.0 2023-10-12 15:39:07,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1103228.0, ans=0.125 2023-10-12 15:39:25,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.24 vs. limit=10.0 2023-10-12 15:39:27,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.43 vs. limit=15.0 2023-10-12 15:39:36,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1103368.0, ans=0.05 2023-10-12 15:39:38,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1103368.0, ans=0.125 2023-10-12 15:39:55,529 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.78 vs. limit=22.5 2023-10-12 15:40:17,520 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.727e+02 1.887e+02 2.055e+02 2.803e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-12 15:40:30,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1103601.3333333333, ans=0.0 2023-10-12 15:40:31,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1103601.3333333333, ans=0.125 2023-10-12 15:40:48,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1103694.6666666667, ans=0.1 2023-10-12 15:40:51,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1103694.6666666667, ans=0.125 2023-10-12 15:41:07,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1103741.3333333333, ans=0.125 2023-10-12 15:41:49,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1103928.0, ans=0.0 2023-10-12 15:41:49,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1103928.0, ans=0.125 2023-10-12 15:41:53,810 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.94 vs. limit=15.0 2023-10-12 15:42:17,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.799e+02 1.961e+02 2.144e+02 2.702e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-12 15:42:33,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104068.0, ans=0.1 2023-10-12 15:43:25,093 INFO [train.py:1031] (0/4) Epoch 18, batch 4500, loss[loss=0.1682, simple_loss=0.258, pruned_loss=0.03923, over 16439.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2827, pruned_loss=0.05051, over 29384811.55 frames. ], batch size: 50, lr: 1.94e-03, grad_scale: 32.0 2023-10-12 15:43:33,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1104301.3333333333, ans=0.1 2023-10-12 15:43:39,701 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.84 vs. limit=6.0 2023-10-12 15:43:41,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104348.0, ans=0.1 2023-10-12 15:43:44,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1104348.0, ans=0.0 2023-10-12 15:43:52,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1104394.6666666667, ans=0.0 2023-10-12 15:43:55,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1104394.6666666667, ans=0.1 2023-10-12 15:43:57,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1104441.3333333333, ans=0.5 2023-10-12 15:44:13,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.752e+02 1.856e+02 2.073e+02 2.905e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-12 15:44:22,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.95 vs. limit=15.0 2023-10-12 15:44:23,724 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-10-12 15:44:30,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1104581.3333333333, ans=0.0 2023-10-12 15:44:38,844 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-12 15:44:47,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1104628.0, ans=0.0 2023-10-12 15:44:57,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1104674.6666666667, ans=0.125 2023-10-12 15:45:07,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1104721.3333333333, ans=0.05 2023-10-12 15:45:19,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1104768.0, ans=0.07 2023-10-12 15:45:26,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1104814.6666666667, ans=0.0 2023-10-12 15:45:27,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1104814.6666666667, ans=0.0 2023-10-12 15:45:28,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1104814.6666666667, ans=0.0 2023-10-12 15:45:43,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1104908.0, ans=0.1 2023-10-12 15:45:52,143 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.47 vs. limit=15.0 2023-10-12 15:45:52,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1104954.6666666667, ans=0.0 2023-10-12 15:45:57,106 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.805e+02 2.037e+02 2.265e+02 3.346e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-12 15:46:08,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1105001.3333333333, ans=0.125 2023-10-12 15:46:09,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.16 vs. limit=15.0 2023-10-12 15:46:17,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105048.0, ans=0.1 2023-10-12 15:46:38,417 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.46 vs. limit=15.0 2023-10-12 15:46:52,377 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:46:58,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1105188.0, ans=0.125 2023-10-12 15:47:13,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1105281.3333333333, ans=12.0 2023-10-12 15:47:44,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.736e+02 1.986e+02 2.156e+02 3.327e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-12 15:47:49,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1105421.3333333333, ans=0.125 2023-10-12 15:47:51,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1105468.0, ans=0.2 2023-10-12 15:48:01,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105514.6666666667, ans=0.1 2023-10-12 15:48:10,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1105514.6666666667, ans=0.2 2023-10-12 15:48:15,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1105561.3333333333, ans=0.125 2023-10-12 15:48:34,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1105654.6666666667, ans=0.0 2023-10-12 15:48:38,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105654.6666666667, ans=0.1 2023-10-12 15:48:46,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1105701.3333333333, ans=0.125 2023-10-12 15:48:48,124 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2023-10-12 15:48:50,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1105701.3333333333, ans=0.0 2023-10-12 15:49:37,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.739e+02 1.981e+02 2.234e+02 3.210e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-12 15:49:43,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105934.6666666667, ans=0.1 2023-10-12 15:50:01,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1105981.3333333333, ans=0.2 2023-10-12 15:50:05,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1106028.0, ans=0.125 2023-10-12 15:50:15,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1106074.6666666667, ans=0.2 2023-10-12 15:50:17,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1106074.6666666667, ans=0.125 2023-10-12 15:50:29,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1106121.3333333333, ans=0.125 2023-10-12 15:50:31,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1106121.3333333333, ans=0.2 2023-10-12 15:50:36,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1106121.3333333333, ans=0.0 2023-10-12 15:50:56,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1106214.6666666667, ans=0.2 2023-10-12 15:51:13,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1106261.3333333333, ans=0.09899494936611666 2023-10-12 15:51:25,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1106308.0, ans=0.125 2023-10-12 15:51:33,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.758e+02 1.897e+02 2.168e+02 3.565e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-12 15:51:53,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1106448.0, ans=0.2 2023-10-12 15:51:57,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1106448.0, ans=0.07 2023-10-12 15:52:32,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1106634.6666666667, ans=0.1 2023-10-12 15:52:33,413 INFO [train.py:1031] (0/4) Epoch 18, batch 5000, loss[loss=0.2046, simple_loss=0.2903, pruned_loss=0.05948, over 16925.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2826, pruned_loss=0.05079, over 30119548.83 frames. ], batch size: 138, lr: 1.94e-03, grad_scale: 16.0 2023-10-12 15:52:50,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1106681.3333333333, ans=0.2 2023-10-12 15:52:59,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1106728.0, ans=0.1 2023-10-12 15:53:04,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1106728.0, ans=0.125 2023-10-12 15:53:22,038 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.721e+02 1.945e+02 2.208e+02 3.554e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-12 15:53:32,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1106868.0, ans=0.0 2023-10-12 15:53:46,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1106914.6666666667, ans=0.2 2023-10-12 15:53:47,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.27 vs. limit=22.5 2023-10-12 15:53:55,184 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-10-12 15:54:03,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1107008.0, ans=0.2 2023-10-12 15:54:33,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1107101.3333333333, ans=0.125 2023-10-12 15:54:39,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1107148.0, ans=0.2 2023-10-12 15:54:45,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1107194.6666666667, ans=0.2 2023-10-12 15:54:50,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1107194.6666666667, ans=0.1 2023-10-12 15:55:05,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.61 vs. limit=15.0 2023-10-12 15:55:06,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1107288.0, ans=0.125 2023-10-12 15:55:11,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.741e+02 1.903e+02 2.102e+02 3.010e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-12 15:55:14,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1107288.0, ans=0.0 2023-10-12 15:55:26,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1107334.6666666667, ans=0.0 2023-10-12 15:55:33,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1107381.3333333333, ans=0.125 2023-10-12 15:55:44,615 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-10-12 15:55:48,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1107474.6666666667, ans=0.125 2023-10-12 15:55:49,058 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.56 vs. limit=15.0 2023-10-12 15:56:02,241 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=12.0 2023-10-12 15:56:26,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1107614.6666666667, ans=0.0 2023-10-12 15:56:31,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1107661.3333333333, ans=0.1 2023-10-12 15:56:45,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1107708.0, ans=0.07 2023-10-12 15:56:53,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-10-12 15:56:56,729 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.830e+02 2.006e+02 2.312e+02 2.939e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-12 15:56:57,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1107754.6666666667, ans=0.025 2023-10-12 15:57:01,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1107754.6666666667, ans=0.0 2023-10-12 15:57:05,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1107801.3333333333, ans=15.0 2023-10-12 15:57:06,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1107801.3333333333, ans=0.025 2023-10-12 15:57:16,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1107848.0, ans=0.0 2023-10-12 15:57:29,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1107894.6666666667, ans=0.2 2023-10-12 15:57:55,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1107988.0, ans=0.125 2023-10-12 15:58:12,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1108081.3333333333, ans=0.125 2023-10-12 15:58:23,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.91 vs. limit=15.0 2023-10-12 15:58:25,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1108128.0, ans=0.125 2023-10-12 15:58:26,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1108128.0, ans=0.125 2023-10-12 15:58:26,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1108128.0, ans=0.2 2023-10-12 15:58:45,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1108174.6666666667, ans=0.125 2023-10-12 15:58:52,310 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 15:58:52,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1108221.3333333333, ans=0.125 2023-10-12 15:58:55,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.654e+02 1.822e+02 2.013e+02 3.284e+02, threshold=3.644e+02, percent-clipped=0.0 2023-10-12 15:58:58,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1108221.3333333333, ans=0.0 2023-10-12 15:59:04,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1108268.0, ans=0.0 2023-10-12 15:59:05,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1108268.0, ans=0.0 2023-10-12 15:59:15,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1108314.6666666667, ans=0.125 2023-10-12 15:59:22,822 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=22.5 2023-10-12 15:59:36,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1108408.0, ans=0.125 2023-10-12 15:59:36,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1108408.0, ans=0.125 2023-10-12 15:59:51,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1108454.6666666667, ans=0.125 2023-10-12 16:00:01,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1108501.3333333333, ans=0.0 2023-10-12 16:00:23,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1108641.3333333333, ans=0.2 2023-10-12 16:00:28,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1108641.3333333333, ans=0.0 2023-10-12 16:00:40,604 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.663e+02 1.798e+02 1.970e+02 2.793e+02, threshold=3.596e+02, percent-clipped=0.0 2023-10-12 16:00:50,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1108734.6666666667, ans=0.125 2023-10-12 16:00:52,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1108734.6666666667, ans=0.125 2023-10-12 16:00:52,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1108734.6666666667, ans=0.125 2023-10-12 16:01:09,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=1108828.0, ans=10.0 2023-10-12 16:01:18,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1108874.6666666667, ans=0.125 2023-10-12 16:01:28,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1108921.3333333333, ans=0.125 2023-10-12 16:01:39,437 INFO [train.py:1031] (0/4) Epoch 18, batch 5500, loss[loss=0.2027, simple_loss=0.2638, pruned_loss=0.07084, over 12631.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2826, pruned_loss=0.0508, over 30714548.03 frames. ], batch size: 440, lr: 1.94e-03, grad_scale: 16.0 2023-10-12 16:01:54,649 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=12.0 2023-10-12 16:02:12,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=1109108.0, ans=0.95 2023-10-12 16:02:25,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.790e+02 1.968e+02 2.182e+02 3.087e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-12 16:02:30,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1109201.3333333333, ans=0.125 2023-10-12 16:02:36,193 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=15.0 2023-10-12 16:02:54,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1109294.6666666667, ans=0.125 2023-10-12 16:03:01,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1109294.6666666667, ans=0.125 2023-10-12 16:03:08,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1109341.3333333333, ans=0.125 2023-10-12 16:03:10,687 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:03:21,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1109388.0, ans=0.125 2023-10-12 16:03:23,256 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.12 vs. limit=15.0 2023-10-12 16:03:46,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1109481.3333333333, ans=0.125 2023-10-12 16:03:47,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.68 vs. limit=10.0 2023-10-12 16:03:59,217 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.46 vs. limit=15.0 2023-10-12 16:04:07,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1109574.6666666667, ans=0.0 2023-10-12 16:04:10,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1109621.3333333333, ans=0.125 2023-10-12 16:04:12,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1109621.3333333333, ans=0.125 2023-10-12 16:04:15,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1109621.3333333333, ans=0.0 2023-10-12 16:04:16,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.730e+02 1.930e+02 2.156e+02 3.440e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-12 16:04:20,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1109621.3333333333, ans=0.025 2023-10-12 16:04:24,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1109668.0, ans=0.2 2023-10-12 16:04:35,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1109714.6666666667, ans=0.0 2023-10-12 16:04:55,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1109808.0, ans=0.2 2023-10-12 16:05:14,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1109854.6666666667, ans=0.2 2023-10-12 16:05:15,688 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:05:21,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1109901.3333333333, ans=10.0 2023-10-12 16:05:44,322 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.39 vs. limit=22.5 2023-10-12 16:06:08,876 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.782e+02 1.937e+02 2.101e+02 2.805e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-12 16:06:10,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1110088.0, ans=0.125 2023-10-12 16:06:13,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110134.6666666667, ans=0.1 2023-10-12 16:06:22,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-10-12 16:06:27,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1110181.3333333333, ans=0.125 2023-10-12 16:06:28,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1110181.3333333333, ans=0.125 2023-10-12 16:06:29,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1110181.3333333333, ans=0.125 2023-10-12 16:06:53,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1110274.6666666667, ans=0.125 2023-10-12 16:07:10,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1110368.0, ans=0.1 2023-10-12 16:07:12,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1110368.0, ans=0.125 2023-10-12 16:07:25,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1110414.6666666667, ans=0.125 2023-10-12 16:07:44,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1110461.3333333333, ans=0.2 2023-10-12 16:08:04,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.755e+02 1.935e+02 2.257e+02 2.969e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 16:08:32,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.39 vs. limit=22.5 2023-10-12 16:08:48,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1110694.6666666667, ans=0.125 2023-10-12 16:08:51,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110741.3333333333, ans=0.1 2023-10-12 16:08:53,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1110741.3333333333, ans=0.125 2023-10-12 16:09:01,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1110788.0, ans=0.09899494936611666 2023-10-12 16:09:09,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1110788.0, ans=0.0 2023-10-12 16:09:30,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1110881.3333333333, ans=0.125 2023-10-12 16:09:35,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1110881.3333333333, ans=0.2 2023-10-12 16:09:55,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1110974.6666666667, ans=0.125 2023-10-12 16:10:10,667 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.682e+02 1.873e+02 2.115e+02 2.878e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-12 16:10:11,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1111021.3333333333, ans=0.125 2023-10-12 16:10:28,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111114.6666666667, ans=0.1 2023-10-12 16:10:28,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1111114.6666666667, ans=0.95 2023-10-12 16:10:33,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1111114.6666666667, ans=0.0 2023-10-12 16:10:40,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.71 vs. limit=15.0 2023-10-12 16:11:02,403 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.60 vs. limit=22.5 2023-10-12 16:11:09,092 INFO [train.py:1031] (0/4) Epoch 18, batch 6000, loss[loss=0.2583, simple_loss=0.3191, pruned_loss=0.09877, over 15510.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2832, pruned_loss=0.05119, over 31184235.72 frames. ], batch size: 350, lr: 1.93e-03, grad_scale: 32.0 2023-10-12 16:11:21,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1111348.0, ans=0.0 2023-10-12 16:11:21,449 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:11:32,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1111394.6666666667, ans=0.07 2023-10-12 16:11:45,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1111441.3333333333, ans=0.125 2023-10-12 16:11:54,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1111441.3333333333, ans=0.125 2023-10-12 16:12:00,908 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=22.5 2023-10-12 16:12:06,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1111488.0, ans=0.0 2023-10-12 16:12:06,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.733e+02 1.952e+02 2.091e+02 4.505e+02, threshold=3.904e+02, percent-clipped=1.0 2023-10-12 16:12:34,356 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.24 vs. limit=15.0 2023-10-12 16:12:53,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111721.3333333333, ans=0.1 2023-10-12 16:13:19,964 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.34 vs. limit=15.0 2023-10-12 16:13:22,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=15.0 2023-10-12 16:13:34,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1111861.3333333333, ans=0.125 2023-10-12 16:13:36,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1111908.0, ans=0.025 2023-10-12 16:13:39,781 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:13:45,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1111908.0, ans=0.0 2023-10-12 16:13:55,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.749e+02 1.858e+02 2.080e+02 3.062e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-12 16:14:19,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1112048.0, ans=0.125 2023-10-12 16:14:23,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1112094.6666666667, ans=0.05 2023-10-12 16:14:23,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1112094.6666666667, ans=0.07 2023-10-12 16:14:30,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1112094.6666666667, ans=0.125 2023-10-12 16:14:48,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1112188.0, ans=0.0 2023-10-12 16:14:52,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.44 vs. limit=15.0 2023-10-12 16:15:02,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1112234.6666666667, ans=0.125 2023-10-12 16:15:09,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1112281.3333333333, ans=0.0 2023-10-12 16:15:31,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1112374.6666666667, ans=0.125 2023-10-12 16:15:37,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1112374.6666666667, ans=0.125 2023-10-12 16:15:46,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1112421.3333333333, ans=0.125 2023-10-12 16:15:52,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.761e+02 1.901e+02 2.096e+02 3.156e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-12 16:16:08,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1112514.6666666667, ans=0.125 2023-10-12 16:16:23,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1112561.3333333333, ans=0.125 2023-10-12 16:16:27,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1112561.3333333333, ans=0.125 2023-10-12 16:16:32,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1112608.0, ans=0.015 2023-10-12 16:16:44,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1112654.6666666667, ans=0.125 2023-10-12 16:17:06,201 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.41 vs. limit=22.5 2023-10-12 16:17:46,841 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.818e+02 2.008e+02 2.184e+02 2.882e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-12 16:17:53,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1112934.6666666667, ans=0.0 2023-10-12 16:17:57,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.14 vs. limit=15.0 2023-10-12 16:18:03,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1112981.3333333333, ans=0.125 2023-10-12 16:18:10,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1112981.3333333333, ans=0.125 2023-10-12 16:18:19,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1113028.0, ans=0.125 2023-10-12 16:18:23,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1113028.0, ans=0.125 2023-10-12 16:18:26,279 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.08 vs. limit=22.5 2023-10-12 16:18:55,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1113168.0, ans=0.2 2023-10-12 16:18:57,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1113168.0, ans=0.0 2023-10-12 16:18:57,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1113168.0, ans=0.0 2023-10-12 16:19:27,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1113308.0, ans=0.0 2023-10-12 16:19:38,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1113354.6666666667, ans=0.1 2023-10-12 16:19:44,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.749e+02 1.947e+02 2.162e+02 3.496e+02, threshold=3.894e+02, percent-clipped=0.0 2023-10-12 16:20:05,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1113448.0, ans=0.125 2023-10-12 16:20:05,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1113448.0, ans=0.125 2023-10-12 16:20:21,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1113494.6666666667, ans=0.0 2023-10-12 16:20:30,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1113541.3333333333, ans=0.04949747468305833 2023-10-12 16:20:46,818 INFO [train.py:1031] (0/4) Epoch 18, batch 6500, loss[loss=0.1855, simple_loss=0.2821, pruned_loss=0.04442, over 16907.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2838, pruned_loss=0.0513, over 31551910.86 frames. ], batch size: 165, lr: 1.93e-03, grad_scale: 32.0 2023-10-12 16:20:51,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1113634.6666666667, ans=0.125 2023-10-12 16:21:02,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-10-12 16:21:03,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1113681.3333333333, ans=0.2 2023-10-12 16:21:03,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.62 vs. limit=10.0 2023-10-12 16:21:10,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1113681.3333333333, ans=0.2 2023-10-12 16:21:14,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1113728.0, ans=0.05 2023-10-12 16:21:32,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1113774.6666666667, ans=0.0 2023-10-12 16:21:49,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.778e+02 1.940e+02 2.143e+02 3.146e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-12 16:21:57,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1113868.0, ans=0.0 2023-10-12 16:22:09,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1113914.6666666667, ans=0.125 2023-10-12 16:22:38,984 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.30 vs. limit=15.0 2023-10-12 16:22:49,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1114101.3333333333, ans=0.125 2023-10-12 16:22:53,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1114101.3333333333, ans=0.125 2023-10-12 16:23:10,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1114194.6666666667, ans=0.125 2023-10-12 16:23:19,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.25 vs. limit=15.0 2023-10-12 16:23:38,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.793e+02 1.979e+02 2.241e+02 3.054e+02, threshold=3.958e+02, percent-clipped=0.0 2023-10-12 16:23:41,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1114288.0, ans=0.125 2023-10-12 16:23:51,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1114334.6666666667, ans=0.125 2023-10-12 16:23:54,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114381.3333333333, ans=0.1 2023-10-12 16:23:55,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1114381.3333333333, ans=0.05 2023-10-12 16:24:21,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1114474.6666666667, ans=0.0 2023-10-12 16:24:26,359 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-10-12 16:24:27,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1114521.3333333333, ans=0.2 2023-10-12 16:24:28,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1114521.3333333333, ans=0.0 2023-10-12 16:24:31,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1114521.3333333333, ans=15.0 2023-10-12 16:24:35,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1114568.0, ans=0.035 2023-10-12 16:24:38,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1114568.0, ans=0.125 2023-10-12 16:24:39,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1114568.0, ans=0.0 2023-10-12 16:25:12,789 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.62 vs. limit=15.0 2023-10-12 16:25:20,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114754.6666666667, ans=0.1 2023-10-12 16:25:29,162 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.676e+02 1.862e+02 2.166e+02 3.279e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-12 16:25:40,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1114801.3333333333, ans=0.2 2023-10-12 16:26:13,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1114941.3333333333, ans=0.125 2023-10-12 16:26:14,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1114941.3333333333, ans=0.125 2023-10-12 16:26:44,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1115034.6666666667, ans=0.0 2023-10-12 16:26:56,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1115081.3333333333, ans=10.0 2023-10-12 16:26:58,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1115081.3333333333, ans=0.0 2023-10-12 16:27:04,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1115081.3333333333, ans=0.125 2023-10-12 16:27:14,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1115128.0, ans=0.2 2023-10-12 16:27:16,194 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-10-12 16:27:16,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1115174.6666666667, ans=0.125 2023-10-12 16:27:16,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1115174.6666666667, ans=0.125 2023-10-12 16:27:39,130 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.692e+02 1.885e+02 2.165e+02 2.971e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-12 16:28:17,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1115408.0, ans=0.125 2023-10-12 16:28:25,873 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-10-12 16:28:31,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1115454.6666666667, ans=0.125 2023-10-12 16:28:50,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=12.0 2023-10-12 16:28:55,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.94 vs. limit=15.0 2023-10-12 16:29:02,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=1115594.6666666667, ans=12.0 2023-10-12 16:29:07,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1115594.6666666667, ans=0.125 2023-10-12 16:29:11,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1115641.3333333333, ans=0.125 2023-10-12 16:29:14,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1115641.3333333333, ans=0.0 2023-10-12 16:29:18,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1115641.3333333333, ans=10.0 2023-10-12 16:29:21,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1115688.0, ans=0.0 2023-10-12 16:29:24,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1115688.0, ans=0.125 2023-10-12 16:29:27,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1115688.0, ans=0.0 2023-10-12 16:29:28,722 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=15.0 2023-10-12 16:29:29,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.746e+02 1.923e+02 2.163e+02 2.844e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-12 16:29:41,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1115781.3333333333, ans=0.09899494936611666 2023-10-12 16:30:04,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1115874.6666666667, ans=0.0 2023-10-12 16:30:16,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1115921.3333333333, ans=0.0 2023-10-12 16:30:23,601 INFO [train.py:1031] (0/4) Epoch 18, batch 7000, loss[loss=0.1835, simple_loss=0.2746, pruned_loss=0.04614, over 16417.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.284, pruned_loss=0.05119, over 31813783.70 frames. ], batch size: 50, lr: 1.93e-03, grad_scale: 16.0 2023-10-12 16:30:42,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1115968.0, ans=0.2 2023-10-12 16:30:46,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1115968.0, ans=0.125 2023-10-12 16:30:56,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1116014.6666666667, ans=0.125 2023-10-12 16:31:12,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1116061.3333333333, ans=0.1 2023-10-12 16:31:12,550 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.15 vs. limit=15.0 2023-10-12 16:31:17,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1116061.3333333333, ans=0.0 2023-10-12 16:31:21,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1116061.3333333333, ans=0.0 2023-10-12 16:31:29,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1116108.0, ans=0.125 2023-10-12 16:31:32,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1116108.0, ans=0.125 2023-10-12 16:31:35,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1116154.6666666667, ans=0.0 2023-10-12 16:31:44,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.759e+02 1.878e+02 2.079e+02 2.701e+02, threshold=3.756e+02, percent-clipped=0.0 2023-10-12 16:31:47,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1116201.3333333333, ans=0.125 2023-10-12 16:31:52,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1116201.3333333333, ans=0.0 2023-10-12 16:31:55,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1116201.3333333333, ans=0.0 2023-10-12 16:32:07,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.83 vs. limit=22.5 2023-10-12 16:32:08,032 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-10-12 16:32:26,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-10-12 16:32:31,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1116341.3333333333, ans=0.0 2023-10-12 16:32:42,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1116388.0, ans=0.0 2023-10-12 16:32:59,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.39 vs. limit=22.5 2023-10-12 16:33:04,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1116481.3333333333, ans=0.0 2023-10-12 16:33:14,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1116528.0, ans=0.125 2023-10-12 16:33:16,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1116528.0, ans=0.125 2023-10-12 16:33:44,646 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.42 vs. limit=22.5 2023-10-12 16:33:51,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.828e+02 1.944e+02 2.126e+02 2.908e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-12 16:33:53,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1116668.0, ans=0.05 2023-10-12 16:34:04,709 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:34:24,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.97 vs. limit=22.5 2023-10-12 16:34:42,387 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:34:57,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1116901.3333333333, ans=0.125 2023-10-12 16:34:57,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1116901.3333333333, ans=0.125 2023-10-12 16:35:08,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1116948.0, ans=0.125 2023-10-12 16:35:30,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1116994.6666666667, ans=0.125 2023-10-12 16:35:31,264 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.49 vs. limit=15.0 2023-10-12 16:35:34,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1116994.6666666667, ans=0.1 2023-10-12 16:35:56,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1117088.0, ans=0.1 2023-10-12 16:35:59,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.714e+02 1.917e+02 2.065e+02 2.614e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-12 16:36:04,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1117134.6666666667, ans=0.125 2023-10-12 16:36:07,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.25 vs. limit=15.0 2023-10-12 16:36:27,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=1117228.0, ans=15.0 2023-10-12 16:36:42,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1117274.6666666667, ans=0.2 2023-10-12 16:36:57,287 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.73 vs. limit=15.0 2023-10-12 16:36:58,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1117321.3333333333, ans=0.025 2023-10-12 16:37:12,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1117368.0, ans=0.1 2023-10-12 16:37:12,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1117368.0, ans=0.1 2023-10-12 16:37:14,891 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:37:21,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1117414.6666666667, ans=0.2 2023-10-12 16:37:27,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1117414.6666666667, ans=0.0 2023-10-12 16:37:29,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1117414.6666666667, ans=0.125 2023-10-12 16:37:31,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1117414.6666666667, ans=0.125 2023-10-12 16:37:48,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1117508.0, ans=0.0 2023-10-12 16:37:50,110 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.64 vs. limit=22.5 2023-10-12 16:38:08,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.64 vs. limit=15.0 2023-10-12 16:38:10,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1117554.6666666667, ans=0.125 2023-10-12 16:38:11,503 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.732e+02 1.887e+02 2.175e+02 5.635e+02, threshold=3.774e+02, percent-clipped=1.0 2023-10-12 16:38:18,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1117601.3333333333, ans=0.05 2023-10-12 16:38:46,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1117741.3333333333, ans=0.0 2023-10-12 16:38:53,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1117741.3333333333, ans=0.0 2023-10-12 16:38:57,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1117788.0, ans=0.125 2023-10-12 16:39:04,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1117788.0, ans=0.125 2023-10-12 16:39:12,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1117834.6666666667, ans=0.125 2023-10-12 16:39:19,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1117881.3333333333, ans=0.125 2023-10-12 16:39:27,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1117928.0, ans=0.2 2023-10-12 16:39:57,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1118068.0, ans=0.2 2023-10-12 16:39:58,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.794e+02 1.973e+02 2.304e+02 3.131e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-12 16:40:29,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.31 vs. limit=22.5 2023-10-12 16:40:32,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1118208.0, ans=0.1 2023-10-12 16:40:33,500 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:40:48,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1118254.6666666667, ans=0.125 2023-10-12 16:40:52,998 INFO [train.py:1031] (0/4) Epoch 18, batch 7500, loss[loss=0.2012, simple_loss=0.2916, pruned_loss=0.05542, over 16956.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2838, pruned_loss=0.05121, over 32035892.07 frames. ], batch size: 123, lr: 1.93e-03, grad_scale: 8.0 2023-10-12 16:41:06,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1118348.0, ans=0.125 2023-10-12 16:41:30,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.94 vs. limit=15.0 2023-10-12 16:41:48,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.851e+02 2.037e+02 2.314e+02 2.738e+02, threshold=4.075e+02, percent-clipped=0.0 2023-10-12 16:41:48,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1118534.6666666667, ans=0.0 2023-10-12 16:41:51,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1118534.6666666667, ans=0.125 2023-10-12 16:42:02,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1118581.3333333333, ans=0.0 2023-10-12 16:42:08,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1118581.3333333333, ans=0.125 2023-10-12 16:42:12,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1118628.0, ans=0.125 2023-10-12 16:42:19,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1118674.6666666667, ans=0.0 2023-10-12 16:42:22,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1118674.6666666667, ans=0.125 2023-10-12 16:42:32,539 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.37 vs. limit=10.0 2023-10-12 16:42:36,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1118721.3333333333, ans=0.125 2023-10-12 16:42:37,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1118721.3333333333, ans=0.0 2023-10-12 16:42:48,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1118768.0, ans=0.125 2023-10-12 16:43:09,923 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.26 vs. limit=15.0 2023-10-12 16:43:13,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1118861.3333333333, ans=0.0 2023-10-12 16:43:55,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.691e+02 1.888e+02 2.199e+02 3.249e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 16:43:58,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1119001.3333333333, ans=0.1 2023-10-12 16:44:00,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1119001.3333333333, ans=0.125 2023-10-12 16:44:03,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1119001.3333333333, ans=0.0 2023-10-12 16:44:06,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1119048.0, ans=0.0 2023-10-12 16:44:15,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1119048.0, ans=0.125 2023-10-12 16:44:52,683 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.27 vs. limit=15.0 2023-10-12 16:44:56,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1119234.6666666667, ans=0.2 2023-10-12 16:45:18,758 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-10-12 16:45:31,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1119374.6666666667, ans=0.2 2023-10-12 16:45:36,182 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=22.5 2023-10-12 16:45:50,792 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.704e+02 1.865e+02 2.081e+02 2.789e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-12 16:46:02,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1119514.6666666667, ans=0.125 2023-10-12 16:46:11,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1119514.6666666667, ans=0.09899494936611666 2023-10-12 16:46:45,026 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-10-12 16:46:48,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1119654.6666666667, ans=0.1 2023-10-12 16:47:27,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1119748.0, ans=0.2 2023-10-12 16:47:54,840 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-10-12 16:48:04,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1119841.3333333333, ans=0.2 2023-10-12 16:48:16,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1119888.0, ans=0.125 2023-10-12 16:48:27,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1119934.6666666667, ans=0.125 2023-10-12 16:48:27,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.774e+02 1.967e+02 2.159e+02 2.955e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-12 16:48:43,220 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-240000.pt 2023-10-12 16:48:56,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.01 vs. limit=10.0 2023-10-12 16:49:04,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1120028.0, ans=0.1 2023-10-12 16:49:06,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.17 vs. limit=15.0 2023-10-12 16:49:23,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1120121.3333333333, ans=0.95 2023-10-12 16:49:29,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1120168.0, ans=0.125 2023-10-12 16:49:34,003 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.05 vs. limit=10.0 2023-10-12 16:50:02,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1120261.3333333333, ans=0.0 2023-10-12 16:50:11,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1120308.0, ans=0.0 2023-10-12 16:50:17,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1120354.6666666667, ans=0.2 2023-10-12 16:50:27,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1120354.6666666667, ans=0.125 2023-10-12 16:50:29,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.702e+02 1.864e+02 2.163e+02 3.610e+02, threshold=3.728e+02, percent-clipped=0.0 2023-10-12 16:50:44,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1120448.0, ans=0.125 2023-10-12 16:50:49,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.78 vs. limit=15.0 2023-10-12 16:51:02,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=22.5 2023-10-12 16:51:06,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1120541.3333333333, ans=0.125 2023-10-12 16:51:11,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1120541.3333333333, ans=0.125 2023-10-12 16:51:11,270 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.163e-02 2023-10-12 16:51:11,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.96 vs. limit=15.0 2023-10-12 16:51:28,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.41 vs. limit=15.0 2023-10-12 16:51:31,094 INFO [train.py:1031] (0/4) Epoch 18, batch 8000, loss[loss=0.1933, simple_loss=0.2837, pruned_loss=0.05149, over 16856.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2831, pruned_loss=0.05049, over 32213063.20 frames. ], batch size: 165, lr: 1.93e-03, grad_scale: 32.0 2023-10-12 16:51:33,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1120634.6666666667, ans=0.0 2023-10-12 16:51:52,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1120728.0, ans=0.2 2023-10-12 16:51:53,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1120728.0, ans=10.0 2023-10-12 16:52:01,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1120728.0, ans=0.125 2023-10-12 16:52:02,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1120774.6666666667, ans=0.0 2023-10-12 16:52:18,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1120821.3333333333, ans=0.0 2023-10-12 16:52:22,448 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 16:52:27,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.683e+02 1.859e+02 2.109e+02 3.287e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-12 16:52:39,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1120914.6666666667, ans=0.0 2023-10-12 16:52:44,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1120914.6666666667, ans=0.125 2023-10-12 16:52:57,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1120961.3333333333, ans=0.0 2023-10-12 16:53:05,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1121008.0, ans=0.125 2023-10-12 16:53:06,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1121008.0, ans=0.125 2023-10-12 16:53:14,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1121054.6666666667, ans=0.125 2023-10-12 16:53:20,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1121054.6666666667, ans=0.125 2023-10-12 16:53:28,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1121101.3333333333, ans=0.0 2023-10-12 16:54:08,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1121288.0, ans=0.125 2023-10-12 16:54:20,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-10-12 16:54:26,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.729e+02 1.866e+02 2.032e+02 2.964e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-12 16:54:35,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1121334.6666666667, ans=0.125 2023-10-12 16:54:43,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.48 vs. limit=10.0 2023-10-12 16:54:57,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1121381.3333333333, ans=0.125 2023-10-12 16:55:07,347 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.19 vs. limit=10.0 2023-10-12 16:55:11,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1121474.6666666667, ans=0.125 2023-10-12 16:55:11,462 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.69 vs. limit=10.0 2023-10-12 16:55:13,646 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-10-12 16:55:30,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1121521.3333333333, ans=0.0 2023-10-12 16:55:34,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1121568.0, ans=0.0 2023-10-12 16:55:51,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1121614.6666666667, ans=0.125 2023-10-12 16:56:21,862 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.02 vs. limit=10.0 2023-10-12 16:56:25,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1121708.0, ans=0.125 2023-10-12 16:56:42,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1121801.3333333333, ans=0.125 2023-10-12 16:56:43,476 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.676e+02 1.814e+02 2.076e+02 2.729e+02, threshold=3.627e+02, percent-clipped=0.0 2023-10-12 16:56:47,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1121801.3333333333, ans=0.125 2023-10-12 16:57:06,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1121848.0, ans=0.125 2023-10-12 16:57:27,742 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-10-12 16:57:42,473 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.47 vs. limit=15.0 2023-10-12 16:57:44,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1121988.0, ans=0.125 2023-10-12 16:57:51,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.24 vs. limit=22.5 2023-10-12 16:57:52,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1122034.6666666667, ans=0.2 2023-10-12 16:58:01,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2023-10-12 16:58:03,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1122081.3333333333, ans=0.125 2023-10-12 16:58:20,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1122128.0, ans=0.125 2023-10-12 16:58:30,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.01 vs. limit=22.5 2023-10-12 16:58:39,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-10-12 16:58:43,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1122221.3333333333, ans=0.125 2023-10-12 16:58:51,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-10-12 16:58:52,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.736e+02 1.909e+02 2.151e+02 3.635e+02, threshold=3.819e+02, percent-clipped=1.0 2023-10-12 16:58:58,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1122268.0, ans=0.2 2023-10-12 16:59:26,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1122408.0, ans=0.125 2023-10-12 16:59:33,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1122408.0, ans=0.125 2023-10-12 17:00:03,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1122548.0, ans=0.2 2023-10-12 17:00:13,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1122548.0, ans=10.0 2023-10-12 17:00:28,230 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:00:58,854 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.693e+02 1.898e+02 2.104e+02 2.605e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-12 17:01:00,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-10-12 17:01:23,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.91 vs. limit=22.5 2023-10-12 17:01:25,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1122828.0, ans=0.125 2023-10-12 17:01:27,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1122828.0, ans=0.125 2023-10-12 17:01:30,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1122828.0, ans=0.1 2023-10-12 17:01:35,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1122828.0, ans=0.0 2023-10-12 17:01:58,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1122921.3333333333, ans=0.0 2023-10-12 17:02:05,083 INFO [train.py:1031] (0/4) Epoch 18, batch 8500, loss[loss=0.2542, simple_loss=0.32, pruned_loss=0.09425, over 15648.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2836, pruned_loss=0.05055, over 32369537.40 frames. ], batch size: 350, lr: 1.92e-03, grad_scale: 32.0 2023-10-12 17:02:11,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1122968.0, ans=0.125 2023-10-12 17:02:26,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1123014.6666666667, ans=0.0 2023-10-12 17:02:33,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1123061.3333333333, ans=0.1 2023-10-12 17:02:42,768 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2023-10-12 17:02:45,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1123108.0, ans=0.0 2023-10-12 17:02:47,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1123108.0, ans=0.125 2023-10-12 17:02:53,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1123154.6666666667, ans=0.125 2023-10-12 17:03:03,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.722e+02 1.927e+02 2.226e+02 2.929e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-12 17:03:05,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1123201.3333333333, ans=0.0 2023-10-12 17:03:05,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1123201.3333333333, ans=10.0 2023-10-12 17:03:17,379 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=15.0 2023-10-12 17:03:18,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1123248.0, ans=0.125 2023-10-12 17:04:03,726 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-10-12 17:04:06,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1123434.6666666667, ans=0.2 2023-10-12 17:04:30,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1123528.0, ans=0.1 2023-10-12 17:04:42,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1123574.6666666667, ans=0.125 2023-10-12 17:04:58,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1123621.3333333333, ans=0.0 2023-10-12 17:04:58,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1123621.3333333333, ans=0.1 2023-10-12 17:05:08,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.371e+02 1.740e+02 1.907e+02 2.187e+02 3.066e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-12 17:05:28,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1123714.6666666667, ans=0.125 2023-10-12 17:05:34,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1123761.3333333333, ans=0.0 2023-10-12 17:05:49,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1123808.0, ans=0.0 2023-10-12 17:05:58,307 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.85 vs. limit=10.0 2023-10-12 17:06:22,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1123901.3333333333, ans=0.2 2023-10-12 17:06:23,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1123901.3333333333, ans=0.1 2023-10-12 17:06:40,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1123994.6666666667, ans=0.0 2023-10-12 17:06:56,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1124041.3333333333, ans=0.125 2023-10-12 17:06:57,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=22.5 2023-10-12 17:07:04,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.24 vs. limit=15.0 2023-10-12 17:07:15,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.712e+02 1.942e+02 2.231e+02 3.364e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-12 17:07:17,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1124134.6666666667, ans=0.1 2023-10-12 17:07:22,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1124134.6666666667, ans=0.2 2023-10-12 17:07:30,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1124181.3333333333, ans=0.1 2023-10-12 17:07:53,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1124274.6666666667, ans=0.0 2023-10-12 17:08:00,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1124274.6666666667, ans=0.0 2023-10-12 17:08:10,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1124321.3333333333, ans=0.125 2023-10-12 17:08:26,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1124414.6666666667, ans=0.125 2023-10-12 17:08:30,499 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:09:06,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1124554.6666666667, ans=0.1 2023-10-12 17:09:13,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.690e+02 1.881e+02 2.115e+02 2.777e+02, threshold=3.762e+02, percent-clipped=0.0 2023-10-12 17:10:12,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1124834.6666666667, ans=0.1 2023-10-12 17:10:52,931 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.11 vs. limit=15.0 2023-10-12 17:11:03,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.792e+02 1.962e+02 2.297e+02 3.365e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-12 17:11:24,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1125161.3333333333, ans=0.125 2023-10-12 17:11:25,017 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-10-12 17:11:35,400 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=12.0 2023-10-12 17:11:43,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-10-12 17:11:45,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1125254.6666666667, ans=0.125 2023-10-12 17:11:45,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1125254.6666666667, ans=0.0 2023-10-12 17:11:54,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1125254.6666666667, ans=0.1 2023-10-12 17:11:56,914 INFO [train.py:1031] (0/4) Epoch 18, batch 9000, loss[loss=0.2048, simple_loss=0.2908, pruned_loss=0.05938, over 15502.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2829, pruned_loss=0.05029, over 32478264.50 frames. ], batch size: 35, lr: 1.92e-03, grad_scale: 32.0 2023-10-12 17:11:58,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1125301.3333333333, ans=0.95 2023-10-12 17:12:11,647 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-10-12 17:12:15,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1125348.0, ans=0.2 2023-10-12 17:12:24,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1125394.6666666667, ans=0.0 2023-10-12 17:12:26,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.03 vs. limit=15.0 2023-10-12 17:12:44,069 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.50 vs. limit=22.5 2023-10-12 17:12:52,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1125488.0, ans=0.125 2023-10-12 17:12:55,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.843e+02 1.996e+02 2.183e+02 3.098e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-12 17:13:11,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1125581.3333333333, ans=0.125 2023-10-12 17:13:11,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-10-12 17:13:16,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1125628.0, ans=0.07 2023-10-12 17:14:00,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1125814.6666666667, ans=0.05 2023-10-12 17:14:07,826 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.26 vs. limit=15.0 2023-10-12 17:14:08,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1125861.3333333333, ans=0.1 2023-10-12 17:14:11,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1125861.3333333333, ans=0.0 2023-10-12 17:14:23,533 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.77 vs. limit=15.0 2023-10-12 17:14:24,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1125908.0, ans=0.125 2023-10-12 17:14:44,583 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.749e+02 1.890e+02 2.092e+02 2.837e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-12 17:14:50,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1126001.3333333333, ans=0.125 2023-10-12 17:15:00,882 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2023-10-12 17:15:09,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1126094.6666666667, ans=0.1 2023-10-12 17:15:20,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.25 vs. limit=22.5 2023-10-12 17:15:37,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1126234.6666666667, ans=0.05 2023-10-12 17:15:46,385 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.78 vs. limit=12.0 2023-10-12 17:16:00,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1126328.0, ans=0.125 2023-10-12 17:16:18,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1126374.6666666667, ans=0.125 2023-10-12 17:16:27,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1126421.3333333333, ans=0.125 2023-10-12 17:16:31,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.814e+02 2.013e+02 2.219e+02 2.979e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-12 17:16:41,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1126514.6666666667, ans=0.125 2023-10-12 17:16:45,371 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-12 17:16:55,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1126561.3333333333, ans=0.125 2023-10-12 17:16:56,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1126561.3333333333, ans=0.125 2023-10-12 17:17:05,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1126608.0, ans=0.125 2023-10-12 17:17:16,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1126654.6666666667, ans=0.0 2023-10-12 17:17:24,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1126701.3333333333, ans=0.2 2023-10-12 17:17:25,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1126701.3333333333, ans=0.125 2023-10-12 17:17:27,722 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:17:32,337 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.66 vs. limit=22.5 2023-10-12 17:17:33,007 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:17:49,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1126794.6666666667, ans=0.125 2023-10-12 17:17:55,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1126794.6666666667, ans=0.125 2023-10-12 17:18:02,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1126841.3333333333, ans=0.125 2023-10-12 17:18:11,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1126888.0, ans=0.1 2023-10-12 17:18:15,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1126888.0, ans=0.125 2023-10-12 17:18:24,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1126934.6666666667, ans=0.2 2023-10-12 17:18:24,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.755e+02 1.927e+02 2.145e+02 3.447e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-12 17:18:53,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=22.5 2023-10-12 17:18:58,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1127028.0, ans=0.05 2023-10-12 17:19:12,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=15.0 2023-10-12 17:19:33,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1127168.0, ans=0.1 2023-10-12 17:19:37,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1127214.6666666667, ans=0.0 2023-10-12 17:19:58,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1127261.3333333333, ans=0.0 2023-10-12 17:20:03,119 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:20:19,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1127354.6666666667, ans=0.125 2023-10-12 17:20:26,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-10-12 17:20:28,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.367e+02 1.814e+02 2.044e+02 2.273e+02 3.060e+02, threshold=4.088e+02, percent-clipped=0.0 2023-10-12 17:20:31,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1127401.3333333333, ans=0.125 2023-10-12 17:20:49,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1127494.6666666667, ans=0.125 2023-10-12 17:21:13,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1127588.0, ans=0.04949747468305833 2023-10-12 17:21:16,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1127588.0, ans=0.0 2023-10-12 17:21:22,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1127588.0, ans=0.125 2023-10-12 17:21:25,788 INFO [train.py:1031] (0/4) Epoch 18, batch 9500, loss[loss=0.207, simple_loss=0.3008, pruned_loss=0.0566, over 16662.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2838, pruned_loss=0.05061, over 32580116.62 frames. ], batch size: 241, lr: 1.92e-03, grad_scale: 16.0 2023-10-12 17:21:37,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1127681.3333333333, ans=0.125 2023-10-12 17:21:43,779 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-10-12 17:21:47,498 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-10-12 17:21:56,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1127728.0, ans=0.2 2023-10-12 17:22:09,037 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.54 vs. limit=15.0 2023-10-12 17:22:10,481 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-10-12 17:22:16,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1127821.3333333333, ans=0.125 2023-10-12 17:22:19,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1127821.3333333333, ans=0.2 2023-10-12 17:22:27,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.92 vs. limit=15.0 2023-10-12 17:22:28,640 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.784e+02 1.917e+02 2.234e+02 3.245e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-12 17:22:31,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1127868.0, ans=0.0 2023-10-12 17:22:34,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1127868.0, ans=0.125 2023-10-12 17:22:54,618 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.11 vs. limit=15.0 2023-10-12 17:22:55,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1127961.3333333333, ans=0.0 2023-10-12 17:22:57,524 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2023-10-12 17:23:01,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=15.0 2023-10-12 17:23:03,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1128008.0, ans=0.0 2023-10-12 17:23:17,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1128054.6666666667, ans=0.125 2023-10-12 17:23:18,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1128054.6666666667, ans=10.0 2023-10-12 17:23:19,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-10-12 17:23:52,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1128194.6666666667, ans=0.125 2023-10-12 17:24:09,753 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-10-12 17:24:11,949 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.90 vs. limit=12.0 2023-10-12 17:24:14,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1128288.0, ans=0.1 2023-10-12 17:24:22,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1128334.6666666667, ans=0.125 2023-10-12 17:24:27,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.708e+02 1.880e+02 2.139e+02 2.628e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-12 17:24:37,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1128381.3333333333, ans=0.125 2023-10-12 17:24:46,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1128381.3333333333, ans=0.1 2023-10-12 17:24:50,554 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.00 vs. limit=15.0 2023-10-12 17:24:52,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1128428.0, ans=0.125 2023-10-12 17:25:15,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1128521.3333333333, ans=0.0 2023-10-12 17:25:26,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1128568.0, ans=0.125 2023-10-12 17:25:29,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1128568.0, ans=0.2 2023-10-12 17:25:34,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1128614.6666666667, ans=0.1 2023-10-12 17:25:42,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1128661.3333333333, ans=0.125 2023-10-12 17:25:47,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.87 vs. limit=12.0 2023-10-12 17:25:56,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1128708.0, ans=0.125 2023-10-12 17:25:58,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1128708.0, ans=0.07 2023-10-12 17:26:00,649 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.52 vs. limit=22.5 2023-10-12 17:26:10,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1128754.6666666667, ans=0.125 2023-10-12 17:26:12,241 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-10-12 17:26:20,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.297e+02 1.749e+02 1.924e+02 2.107e+02 2.814e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-12 17:26:25,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1128848.0, ans=0.125 2023-10-12 17:26:33,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1128848.0, ans=0.125 2023-10-12 17:26:36,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1128848.0, ans=0.125 2023-10-12 17:26:41,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1128894.6666666667, ans=0.07 2023-10-12 17:26:45,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1128894.6666666667, ans=0.125 2023-10-12 17:27:07,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1128988.0, ans=0.125 2023-10-12 17:27:32,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1129081.3333333333, ans=0.125 2023-10-12 17:27:39,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1129128.0, ans=0.1 2023-10-12 17:27:54,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1129174.6666666667, ans=0.125 2023-10-12 17:27:56,903 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-10-12 17:28:09,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1129268.0, ans=0.125 2023-10-12 17:28:10,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1129268.0, ans=0.0 2023-10-12 17:28:14,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.736e+02 1.885e+02 2.105e+02 2.832e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-12 17:28:21,084 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.89 vs. limit=15.0 2023-10-12 17:28:24,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1129314.6666666667, ans=0.0 2023-10-12 17:28:32,816 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-10-12 17:28:44,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1129408.0, ans=0.125 2023-10-12 17:29:04,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1129454.6666666667, ans=0.125 2023-10-12 17:29:32,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1129594.6666666667, ans=0.125 2023-10-12 17:29:39,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1129641.3333333333, ans=0.0 2023-10-12 17:30:01,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1129734.6666666667, ans=0.1 2023-10-12 17:30:03,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.724e+02 1.936e+02 2.207e+02 2.932e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-12 17:30:20,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1129781.3333333333, ans=0.125 2023-10-12 17:30:30,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1129828.0, ans=0.125 2023-10-12 17:30:31,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1129828.0, ans=0.1 2023-10-12 17:30:35,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.40 vs. limit=15.0 2023-10-12 17:30:50,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1129921.3333333333, ans=0.2 2023-10-12 17:30:53,599 INFO [train.py:1031] (0/4) Epoch 18, batch 10000, loss[loss=0.1832, simple_loss=0.2748, pruned_loss=0.04579, over 16985.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2828, pruned_loss=0.05024, over 32617404.55 frames. ], batch size: 77, lr: 1.92e-03, grad_scale: 32.0 2023-10-12 17:31:12,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1130014.6666666667, ans=0.0 2023-10-12 17:31:25,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1130061.3333333333, ans=0.1 2023-10-12 17:31:33,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1130108.0, ans=0.0 2023-10-12 17:31:34,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1130108.0, ans=0.125 2023-10-12 17:31:50,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-10-12 17:31:55,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.762e+02 1.909e+02 2.073e+02 3.129e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-12 17:32:05,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1130248.0, ans=0.125 2023-10-12 17:32:48,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-10-12 17:32:50,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1130388.0, ans=0.125 2023-10-12 17:32:53,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1130388.0, ans=0.125 2023-10-12 17:33:19,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1130481.3333333333, ans=0.0 2023-10-12 17:33:25,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1130481.3333333333, ans=0.5 2023-10-12 17:34:01,436 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.73 vs. limit=22.5 2023-10-12 17:34:06,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1130668.0, ans=0.2 2023-10-12 17:34:07,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.836e+02 2.011e+02 2.219e+02 2.829e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-12 17:34:22,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1130761.3333333333, ans=0.1 2023-10-12 17:34:37,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.00 vs. limit=22.5 2023-10-12 17:34:48,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1130854.6666666667, ans=0.0 2023-10-12 17:34:48,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1130854.6666666667, ans=0.125 2023-10-12 17:34:54,923 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.46 vs. limit=15.0 2023-10-12 17:35:20,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1130948.0, ans=0.0 2023-10-12 17:35:25,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1130994.6666666667, ans=0.125 2023-10-12 17:35:35,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.03 vs. limit=15.0 2023-10-12 17:36:01,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1131088.0, ans=0.0 2023-10-12 17:36:10,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1131088.0, ans=0.125 2023-10-12 17:36:32,998 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.725e+02 1.871e+02 2.081e+02 3.009e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-12 17:36:46,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1131181.3333333333, ans=0.2 2023-10-12 17:37:07,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1131274.6666666667, ans=0.0 2023-10-12 17:37:21,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1131321.3333333333, ans=0.125 2023-10-12 17:37:29,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.94 vs. limit=15.0 2023-10-12 17:37:50,845 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:38:05,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1131508.0, ans=0.125 2023-10-12 17:38:11,930 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.12 vs. limit=12.0 2023-10-12 17:38:13,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1131508.0, ans=0.125 2023-10-12 17:38:17,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1131554.6666666667, ans=0.1 2023-10-12 17:38:17,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-10-12 17:38:24,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1131554.6666666667, ans=0.2 2023-10-12 17:38:31,660 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:38:34,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.695e+02 1.908e+02 2.104e+02 3.368e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-12 17:38:38,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1131601.3333333333, ans=0.0 2023-10-12 17:38:46,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1131648.0, ans=0.125 2023-10-12 17:39:15,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1131788.0, ans=0.125 2023-10-12 17:39:39,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2023-10-12 17:39:57,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1131928.0, ans=0.0 2023-10-12 17:39:59,624 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-10-12 17:40:09,493 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.65 vs. limit=22.5 2023-10-12 17:40:20,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.08 vs. limit=22.5 2023-10-12 17:40:24,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1132021.3333333333, ans=0.125 2023-10-12 17:40:26,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.40 vs. limit=15.0 2023-10-12 17:40:30,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1132021.3333333333, ans=0.125 2023-10-12 17:40:37,074 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.727e+02 1.836e+02 2.090e+02 2.772e+02, threshold=3.671e+02, percent-clipped=0.0 2023-10-12 17:40:41,961 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-10-12 17:41:26,773 INFO [train.py:1031] (0/4) Epoch 18, batch 10500, loss[loss=0.189, simple_loss=0.2823, pruned_loss=0.04791, over 16923.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.283, pruned_loss=0.05018, over 32667536.33 frames. ], batch size: 72, lr: 1.92e-03, grad_scale: 32.0 2023-10-12 17:41:30,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1132301.3333333333, ans=0.0 2023-10-12 17:41:36,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1132348.0, ans=0.125 2023-10-12 17:41:45,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1132348.0, ans=0.125 2023-10-12 17:42:38,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1132488.0, ans=0.0 2023-10-12 17:42:44,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-10-12 17:42:48,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.733e+02 1.895e+02 2.115e+02 3.612e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-12 17:42:54,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=12.0 2023-10-12 17:43:31,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1132674.6666666667, ans=0.125 2023-10-12 17:43:52,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1132768.0, ans=0.05 2023-10-12 17:43:59,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1132814.6666666667, ans=0.125 2023-10-12 17:44:03,292 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-10-12 17:44:04,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1132814.6666666667, ans=0.0 2023-10-12 17:44:05,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1132814.6666666667, ans=0.125 2023-10-12 17:44:06,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1132814.6666666667, ans=0.0 2023-10-12 17:44:16,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1132861.3333333333, ans=0.95 2023-10-12 17:44:18,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.88 vs. limit=10.0 2023-10-12 17:44:29,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1132908.0, ans=0.125 2023-10-12 17:44:38,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1132954.6666666667, ans=0.125 2023-10-12 17:44:51,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.688e+02 1.798e+02 1.960e+02 2.590e+02, threshold=3.597e+02, percent-clipped=0.0 2023-10-12 17:45:04,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1133048.0, ans=0.1 2023-10-12 17:45:05,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-10-12 17:45:07,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1133094.6666666667, ans=0.125 2023-10-12 17:45:35,332 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.654e-03 2023-10-12 17:45:43,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.57 vs. limit=6.0 2023-10-12 17:45:53,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1133234.6666666667, ans=0.1 2023-10-12 17:45:57,971 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.18 vs. limit=15.0 2023-10-12 17:46:30,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1133374.6666666667, ans=0.125 2023-10-12 17:46:50,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1133468.0, ans=0.125 2023-10-12 17:46:57,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.811e+02 2.005e+02 2.308e+02 3.295e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-12 17:47:02,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-10-12 17:47:03,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1133514.6666666667, ans=0.125 2023-10-12 17:47:04,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1133514.6666666667, ans=0.125 2023-10-12 17:47:15,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1133561.3333333333, ans=0.0 2023-10-12 17:47:17,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1133561.3333333333, ans=0.125 2023-10-12 17:47:22,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1133561.3333333333, ans=0.125 2023-10-12 17:47:35,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1133654.6666666667, ans=0.0 2023-10-12 17:47:43,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1133654.6666666667, ans=0.0 2023-10-12 17:47:46,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1133654.6666666667, ans=0.0 2023-10-12 17:47:59,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1133748.0, ans=0.0 2023-10-12 17:48:04,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1133748.0, ans=0.125 2023-10-12 17:48:13,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1133794.6666666667, ans=0.0 2023-10-12 17:48:18,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2023-10-12 17:48:42,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1133934.6666666667, ans=0.0 2023-10-12 17:48:47,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.770e+02 2.007e+02 2.260e+02 3.002e+02, threshold=4.014e+02, percent-clipped=0.0 2023-10-12 17:48:56,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1133981.3333333333, ans=0.0 2023-10-12 17:49:09,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1133981.3333333333, ans=0.125 2023-10-12 17:49:22,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1134028.0, ans=0.1 2023-10-12 17:49:27,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1134074.6666666667, ans=0.125 2023-10-12 17:49:48,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1134168.0, ans=0.09899494936611666 2023-10-12 17:50:00,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1134214.6666666667, ans=0.125 2023-10-12 17:50:25,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1134308.0, ans=0.0 2023-10-12 17:50:38,357 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:50:46,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1134401.3333333333, ans=0.1 2023-10-12 17:50:47,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1134401.3333333333, ans=0.125 2023-10-12 17:50:49,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.326e+02 1.657e+02 1.876e+02 2.024e+02 3.344e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-12 17:50:54,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1134448.0, ans=0.07 2023-10-12 17:51:08,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1134494.6666666667, ans=0.125 2023-10-12 17:51:21,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1134541.3333333333, ans=0.125 2023-10-12 17:51:22,846 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:51:24,272 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=12.0 2023-10-12 17:51:37,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-10-12 17:51:39,255 INFO [train.py:1031] (0/4) Epoch 18, batch 11000, loss[loss=0.1915, simple_loss=0.2837, pruned_loss=0.0497, over 16537.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2831, pruned_loss=0.05021, over 32729871.12 frames. ], batch size: 50, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 17:51:56,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1134681.3333333333, ans=0.125 2023-10-12 17:52:10,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1134774.6666666667, ans=0.0 2023-10-12 17:52:21,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1134774.6666666667, ans=0.0 2023-10-12 17:52:22,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1134821.3333333333, ans=0.125 2023-10-12 17:52:35,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1134868.0, ans=0.2 2023-10-12 17:52:42,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.782e+02 1.975e+02 2.259e+02 3.388e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-12 17:52:42,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1134868.0, ans=0.125 2023-10-12 17:53:30,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-10-12 17:53:48,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1135101.3333333333, ans=0.0 2023-10-12 17:54:04,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1135194.6666666667, ans=0.0 2023-10-12 17:54:10,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1135194.6666666667, ans=0.0 2023-10-12 17:54:25,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1135241.3333333333, ans=0.0 2023-10-12 17:54:50,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.639e+02 1.818e+02 1.970e+02 3.123e+02, threshold=3.636e+02, percent-clipped=0.0 2023-10-12 17:54:55,663 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.85 vs. limit=15.0 2023-10-12 17:55:17,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1135474.6666666667, ans=0.1 2023-10-12 17:55:24,208 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.79 vs. limit=15.0 2023-10-12 17:55:43,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135568.0, ans=0.1 2023-10-12 17:55:45,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1135568.0, ans=0.04949747468305833 2023-10-12 17:55:49,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1135568.0, ans=0.125 2023-10-12 17:55:58,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1135614.6666666667, ans=0.0 2023-10-12 17:56:27,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=15.0 2023-10-12 17:56:29,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1135754.6666666667, ans=0.125 2023-10-12 17:56:37,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1135801.3333333333, ans=0.1 2023-10-12 17:56:41,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.690e+02 1.832e+02 2.010e+02 2.819e+02, threshold=3.664e+02, percent-clipped=0.0 2023-10-12 17:56:42,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1135801.3333333333, ans=0.1 2023-10-12 17:56:42,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1135801.3333333333, ans=0.0 2023-10-12 17:56:55,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1135848.0, ans=0.125 2023-10-12 17:56:58,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1135894.6666666667, ans=0.125 2023-10-12 17:57:06,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1135894.6666666667, ans=0.125 2023-10-12 17:57:16,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1135941.3333333333, ans=0.125 2023-10-12 17:57:28,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1135988.0, ans=0.0 2023-10-12 17:57:42,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1136034.6666666667, ans=0.125 2023-10-12 17:57:42,797 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.06 vs. limit=15.0 2023-10-12 17:57:52,749 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 17:58:00,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136081.3333333333, ans=0.1 2023-10-12 17:58:16,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136174.6666666667, ans=0.1 2023-10-12 17:58:32,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1136221.3333333333, ans=0.0 2023-10-12 17:58:35,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1136268.0, ans=0.125 2023-10-12 17:58:42,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.738e+02 1.857e+02 2.075e+02 3.104e+02, threshold=3.714e+02, percent-clipped=0.0 2023-10-12 17:58:47,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1136314.6666666667, ans=0.2 2023-10-12 17:59:15,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1136408.0, ans=0.0 2023-10-12 17:59:32,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.97 vs. limit=22.5 2023-10-12 17:59:36,742 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.51 vs. limit=22.5 2023-10-12 17:59:38,710 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.48 vs. limit=15.0 2023-10-12 17:59:47,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1136548.0, ans=0.125 2023-10-12 17:59:52,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1136594.6666666667, ans=0.125 2023-10-12 17:59:55,728 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-10-12 17:59:56,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1136594.6666666667, ans=0.125 2023-10-12 17:59:57,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1136594.6666666667, ans=0.0 2023-10-12 18:00:04,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1136641.3333333333, ans=0.2 2023-10-12 18:00:08,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136641.3333333333, ans=0.1 2023-10-12 18:00:11,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1136641.3333333333, ans=0.05 2023-10-12 18:00:13,729 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.55 vs. limit=15.0 2023-10-12 18:00:26,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.34 vs. limit=15.0 2023-10-12 18:00:28,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1136734.6666666667, ans=0.125 2023-10-12 18:00:30,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1136734.6666666667, ans=0.2 2023-10-12 18:00:35,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.809e+02 2.024e+02 2.403e+02 3.192e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-12 18:00:40,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1136781.3333333333, ans=0.125 2023-10-12 18:00:58,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1136828.0, ans=0.015 2023-10-12 18:01:27,028 INFO [train.py:1031] (0/4) Epoch 18, batch 11500, loss[loss=0.1868, simple_loss=0.2842, pruned_loss=0.04475, over 16490.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2831, pruned_loss=0.05025, over 32766828.00 frames. ], batch size: 266, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 18:01:32,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1136968.0, ans=0.125 2023-10-12 18:01:33,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136968.0, ans=0.1 2023-10-12 18:01:37,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1136968.0, ans=0.125 2023-10-12 18:01:40,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1137014.6666666667, ans=0.1 2023-10-12 18:01:55,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1137061.3333333333, ans=0.0 2023-10-12 18:02:00,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1137108.0, ans=0.1 2023-10-12 18:02:07,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1137108.0, ans=0.025 2023-10-12 18:02:09,142 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:02:33,183 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.735e+02 1.907e+02 2.151e+02 2.779e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-12 18:02:35,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1137201.3333333333, ans=0.125 2023-10-12 18:02:37,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1137248.0, ans=0.0 2023-10-12 18:02:40,558 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.21 vs. limit=22.5 2023-10-12 18:02:49,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1137294.6666666667, ans=0.125 2023-10-12 18:02:50,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1137294.6666666667, ans=0.125 2023-10-12 18:03:11,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1137341.3333333333, ans=0.125 2023-10-12 18:03:13,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1137388.0, ans=0.125 2023-10-12 18:03:58,812 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.18 vs. limit=22.5 2023-10-12 18:04:29,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.675e+02 1.787e+02 1.966e+02 2.902e+02, threshold=3.574e+02, percent-clipped=0.0 2023-10-12 18:04:32,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1137714.6666666667, ans=0.0 2023-10-12 18:05:06,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1137854.6666666667, ans=0.125 2023-10-12 18:05:07,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1137854.6666666667, ans=0.07 2023-10-12 18:05:13,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1137854.6666666667, ans=0.125 2023-10-12 18:06:01,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1138088.0, ans=0.125 2023-10-12 18:06:08,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.05 vs. limit=22.5 2023-10-12 18:06:22,445 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.714e+02 1.887e+02 2.320e+02 2.923e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-12 18:06:42,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1138181.3333333333, ans=0.0 2023-10-12 18:06:48,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1138228.0, ans=0.025 2023-10-12 18:07:01,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1138274.6666666667, ans=0.95 2023-10-12 18:07:04,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1138274.6666666667, ans=0.125 2023-10-12 18:07:04,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1138274.6666666667, ans=0.125 2023-10-12 18:07:07,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1138274.6666666667, ans=0.125 2023-10-12 18:08:06,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1138554.6666666667, ans=0.0 2023-10-12 18:08:17,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1138554.6666666667, ans=0.1 2023-10-12 18:08:21,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1138601.3333333333, ans=0.125 2023-10-12 18:08:25,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1138601.3333333333, ans=0.1 2023-10-12 18:08:30,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.715e+02 1.855e+02 2.111e+02 2.794e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-12 18:08:37,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-10-12 18:08:43,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1138648.0, ans=0.0 2023-10-12 18:08:53,133 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-10-12 18:08:54,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1138694.6666666667, ans=0.125 2023-10-12 18:09:03,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1138741.3333333333, ans=0.0 2023-10-12 18:09:24,610 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.23 vs. limit=15.0 2023-10-12 18:09:25,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.53 vs. limit=10.0 2023-10-12 18:09:28,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1138834.6666666667, ans=0.125 2023-10-12 18:09:31,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1138834.6666666667, ans=0.025 2023-10-12 18:09:47,550 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:09:52,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1138928.0, ans=0.125 2023-10-12 18:09:54,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1138928.0, ans=0.125 2023-10-12 18:09:55,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.07 vs. limit=10.0 2023-10-12 18:10:28,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.760e+02 1.935e+02 2.106e+02 3.226e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-12 18:10:38,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1139114.6666666667, ans=0.0 2023-10-12 18:10:51,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1139161.3333333333, ans=0.125 2023-10-12 18:10:55,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1139208.0, ans=0.125 2023-10-12 18:11:11,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1139254.6666666667, ans=0.0 2023-10-12 18:11:19,516 INFO [train.py:1031] (0/4) Epoch 18, batch 12000, loss[loss=0.1852, simple_loss=0.2796, pruned_loss=0.04542, over 16726.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2832, pruned_loss=0.05004, over 32792439.78 frames. ], batch size: 202, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 18:11:22,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1139301.3333333333, ans=0.2 2023-10-12 18:11:30,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1139301.3333333333, ans=0.125 2023-10-12 18:11:32,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1139348.0, ans=0.1 2023-10-12 18:11:37,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1139348.0, ans=0.125 2023-10-12 18:11:50,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1139394.6666666667, ans=0.125 2023-10-12 18:12:24,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.684e+02 1.843e+02 2.039e+02 3.682e+02, threshold=3.687e+02, percent-clipped=0.0 2023-10-12 18:12:29,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1139581.3333333333, ans=0.125 2023-10-12 18:12:41,116 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:13:29,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1139814.6666666667, ans=0.09899494936611666 2023-10-12 18:13:33,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1139814.6666666667, ans=0.02 2023-10-12 18:13:36,579 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.41 vs. limit=22.5 2023-10-12 18:13:49,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2023-10-12 18:13:52,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1139908.0, ans=0.5 2023-10-12 18:13:55,133 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.68 vs. limit=22.5 2023-10-12 18:13:55,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1139908.0, ans=0.025 2023-10-12 18:13:59,627 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-10-12 18:14:13,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1140001.3333333333, ans=0.0 2023-10-12 18:14:29,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.301e+02 1.712e+02 1.882e+02 2.059e+02 3.013e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-12 18:14:44,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1140048.0, ans=0.125 2023-10-12 18:15:02,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1140094.6666666667, ans=0.04949747468305833 2023-10-12 18:15:25,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1140188.0, ans=0.125 2023-10-12 18:15:31,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1140188.0, ans=0.0 2023-10-12 18:15:33,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1140234.6666666667, ans=0.2 2023-10-12 18:15:59,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1140328.0, ans=0.125 2023-10-12 18:16:08,479 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:16:13,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.20 vs. limit=22.5 2023-10-12 18:16:14,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1140374.6666666667, ans=0.125 2023-10-12 18:16:25,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.88 vs. limit=10.0 2023-10-12 18:16:34,366 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:16:38,385 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.739e+02 1.922e+02 2.081e+02 2.744e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-12 18:16:53,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1140561.3333333333, ans=0.125 2023-10-12 18:16:54,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1140561.3333333333, ans=0.0 2023-10-12 18:16:57,669 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-10-12 18:17:07,878 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2023-10-12 18:17:11,004 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.64 vs. limit=15.0 2023-10-12 18:17:36,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1140701.3333333333, ans=0.125 2023-10-12 18:17:50,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1140794.6666666667, ans=0.0 2023-10-12 18:17:53,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1140794.6666666667, ans=0.0 2023-10-12 18:17:55,162 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.76 vs. limit=15.0 2023-10-12 18:18:14,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1140888.0, ans=0.2 2023-10-12 18:18:17,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1140888.0, ans=0.125 2023-10-12 18:18:21,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1140888.0, ans=0.0 2023-10-12 18:18:33,046 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.787e+02 1.953e+02 2.241e+02 3.560e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-12 18:18:50,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1140981.3333333333, ans=0.1 2023-10-12 18:19:13,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1141121.3333333333, ans=0.2 2023-10-12 18:20:04,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1141308.0, ans=0.2 2023-10-12 18:20:17,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1141354.6666666667, ans=0.0 2023-10-12 18:20:18,277 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.23 vs. limit=15.0 2023-10-12 18:20:22,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1141401.3333333333, ans=0.125 2023-10-12 18:20:27,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.714e+02 1.904e+02 2.053e+02 3.023e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 18:20:33,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1141448.0, ans=0.0 2023-10-12 18:20:38,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1141448.0, ans=0.0 2023-10-12 18:20:40,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1141448.0, ans=0.1 2023-10-12 18:20:59,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1141541.3333333333, ans=0.0 2023-10-12 18:21:09,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1141588.0, ans=0.125 2023-10-12 18:21:10,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1141588.0, ans=0.2 2023-10-12 18:21:17,788 INFO [train.py:1031] (0/4) Epoch 18, batch 12500, loss[loss=0.2247, simple_loss=0.3125, pruned_loss=0.06844, over 16630.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2828, pruned_loss=0.05005, over 32772687.88 frames. ], batch size: 56, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 18:21:40,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.55 vs. limit=12.0 2023-10-12 18:21:42,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1141728.0, ans=0.0 2023-10-12 18:21:44,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1141728.0, ans=0.0 2023-10-12 18:21:46,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1141728.0, ans=0.125 2023-10-12 18:21:46,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1141728.0, ans=0.0 2023-10-12 18:21:53,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1141774.6666666667, ans=0.125 2023-10-12 18:22:19,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.838e+02 2.086e+02 2.392e+02 3.402e+02, threshold=4.172e+02, percent-clipped=0.0 2023-10-12 18:22:20,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1141868.0, ans=0.125 2023-10-12 18:22:26,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1141914.6666666667, ans=0.0 2023-10-12 18:22:34,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=22.5 2023-10-12 18:22:50,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1142008.0, ans=0.125 2023-10-12 18:23:17,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1142148.0, ans=0.125 2023-10-12 18:23:27,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1142194.6666666667, ans=0.025 2023-10-12 18:23:51,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1142288.0, ans=0.2 2023-10-12 18:23:53,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1142288.0, ans=0.125 2023-10-12 18:24:01,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1142334.6666666667, ans=0.125 2023-10-12 18:24:12,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.712e+02 1.913e+02 2.162e+02 2.701e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-12 18:24:16,188 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.04 vs. limit=15.0 2023-10-12 18:24:20,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1142381.3333333333, ans=0.2 2023-10-12 18:24:21,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1142381.3333333333, ans=0.0 2023-10-12 18:24:31,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1142428.0, ans=0.0 2023-10-12 18:24:35,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2023-10-12 18:24:52,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1142474.6666666667, ans=0.125 2023-10-12 18:25:09,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1142568.0, ans=0.0 2023-10-12 18:25:13,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1142568.0, ans=0.0 2023-10-12 18:25:15,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.88 vs. limit=6.0 2023-10-12 18:25:16,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-10-12 18:25:39,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.70 vs. limit=6.0 2023-10-12 18:25:42,177 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:25:58,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1142754.6666666667, ans=0.125 2023-10-12 18:26:09,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.348e+02 1.728e+02 1.889e+02 2.100e+02 2.775e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-12 18:26:19,367 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.49 vs. limit=15.0 2023-10-12 18:26:21,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-10-12 18:26:34,765 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=12.0 2023-10-12 18:26:38,082 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.78 vs. limit=12.0 2023-10-12 18:26:42,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1142941.3333333333, ans=0.1 2023-10-12 18:26:48,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1142988.0, ans=0.125 2023-10-12 18:26:49,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1142988.0, ans=0.035 2023-10-12 18:26:57,980 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:26:59,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1143034.6666666667, ans=0.0 2023-10-12 18:27:17,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=22.5 2023-10-12 18:27:33,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1143174.6666666667, ans=0.1 2023-10-12 18:27:51,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=12.0 2023-10-12 18:28:02,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1143268.0, ans=0.125 2023-10-12 18:28:03,180 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.26 vs. limit=15.0 2023-10-12 18:28:05,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.725e+02 1.888e+02 2.146e+02 3.261e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 18:28:33,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1143408.0, ans=0.125 2023-10-12 18:28:37,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1143408.0, ans=0.125 2023-10-12 18:28:46,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-10-12 18:28:59,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1143501.3333333333, ans=22.5 2023-10-12 18:29:01,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.37 vs. limit=12.0 2023-10-12 18:29:10,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1143548.0, ans=0.0 2023-10-12 18:29:34,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1143641.3333333333, ans=0.0 2023-10-12 18:29:40,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1143641.3333333333, ans=0.1 2023-10-12 18:29:55,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1143734.6666666667, ans=0.125 2023-10-12 18:30:03,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.739e+02 1.923e+02 2.098e+02 3.034e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-12 18:30:25,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1143828.0, ans=0.125 2023-10-12 18:30:28,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1143874.6666666667, ans=0.09899494936611666 2023-10-12 18:30:40,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1143921.3333333333, ans=15.0 2023-10-12 18:30:41,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1143921.3333333333, ans=0.1 2023-10-12 18:30:48,220 INFO [train.py:1031] (0/4) Epoch 18, batch 13000, loss[loss=0.2005, simple_loss=0.2923, pruned_loss=0.05437, over 16828.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2833, pruned_loss=0.05002, over 32793248.19 frames. ], batch size: 146, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 18:32:05,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.742e+02 1.891e+02 2.120e+02 3.097e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-12 18:32:10,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1144248.0, ans=0.0 2023-10-12 18:32:19,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1144294.6666666667, ans=0.125 2023-10-12 18:32:22,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1144294.6666666667, ans=0.1 2023-10-12 18:32:38,647 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-10-12 18:32:41,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1144341.3333333333, ans=0.1 2023-10-12 18:33:02,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1144434.6666666667, ans=0.125 2023-10-12 18:33:03,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1144481.3333333333, ans=0.125 2023-10-12 18:33:39,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1144621.3333333333, ans=0.0 2023-10-12 18:33:42,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1144621.3333333333, ans=0.0 2023-10-12 18:34:00,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.658e+02 1.854e+02 2.045e+02 3.100e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 18:34:04,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1144714.6666666667, ans=0.125 2023-10-12 18:34:11,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1144714.6666666667, ans=0.0 2023-10-12 18:34:17,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1144761.3333333333, ans=0.1 2023-10-12 18:34:25,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1144761.3333333333, ans=0.125 2023-10-12 18:34:30,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1144808.0, ans=0.125 2023-10-12 18:34:34,445 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.98 vs. limit=15.0 2023-10-12 18:34:35,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1144808.0, ans=0.125 2023-10-12 18:34:48,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1144854.6666666667, ans=0.2 2023-10-12 18:34:48,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1144854.6666666667, ans=0.125 2023-10-12 18:34:52,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1144854.6666666667, ans=0.125 2023-10-12 18:35:10,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1144948.0, ans=0.125 2023-10-12 18:35:12,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1144948.0, ans=0.025 2023-10-12 18:35:27,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1144994.6666666667, ans=0.125 2023-10-12 18:35:30,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.17 vs. limit=15.0 2023-10-12 18:35:42,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1145041.3333333333, ans=0.125 2023-10-12 18:35:49,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.91 vs. limit=15.0 2023-10-12 18:36:00,037 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-10-12 18:36:01,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1145134.6666666667, ans=0.125 2023-10-12 18:36:04,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.699e+02 1.892e+02 2.071e+02 2.784e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-12 18:36:06,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1145181.3333333333, ans=0.125 2023-10-12 18:36:11,084 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:36:22,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1145228.0, ans=0.1 2023-10-12 18:36:29,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1145274.6666666667, ans=0.125 2023-10-12 18:36:34,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1145274.6666666667, ans=0.1 2023-10-12 18:36:38,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1145274.6666666667, ans=0.0 2023-10-12 18:36:53,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1145368.0, ans=0.125 2023-10-12 18:37:00,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1145368.0, ans=0.0 2023-10-12 18:37:11,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1145414.6666666667, ans=0.125 2023-10-12 18:37:34,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-10-12 18:37:40,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1145554.6666666667, ans=0.0 2023-10-12 18:37:58,220 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.739e+02 1.888e+02 2.031e+02 3.036e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 18:38:11,365 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.11 vs. limit=22.5 2023-10-12 18:38:24,212 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=12.0 2023-10-12 18:38:27,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1145741.3333333333, ans=0.2 2023-10-12 18:38:28,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1145741.3333333333, ans=0.2 2023-10-12 18:38:34,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1145788.0, ans=0.125 2023-10-12 18:38:51,995 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.29 vs. limit=22.5 2023-10-12 18:38:52,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1145834.6666666667, ans=0.1 2023-10-12 18:39:03,094 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=12.0 2023-10-12 18:39:20,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1145974.6666666667, ans=0.07 2023-10-12 18:39:41,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.33 vs. limit=15.0 2023-10-12 18:39:44,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1146068.0, ans=0.0 2023-10-12 18:39:50,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.770e+02 1.936e+02 2.172e+02 3.199e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-12 18:39:57,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1146114.6666666667, ans=0.0 2023-10-12 18:40:04,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.89 vs. limit=10.0 2023-10-12 18:40:16,552 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-10-12 18:40:23,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1146254.6666666667, ans=0.125 2023-10-12 18:40:33,685 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-10-12 18:40:35,988 INFO [train.py:1031] (0/4) Epoch 18, batch 13500, loss[loss=0.1749, simple_loss=0.275, pruned_loss=0.03739, over 16957.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2826, pruned_loss=0.04985, over 32789178.19 frames. ], batch size: 93, lr: 1.91e-03, grad_scale: 32.0 2023-10-12 18:40:42,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1146301.3333333333, ans=0.125 2023-10-12 18:40:54,076 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:41:03,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1146394.6666666667, ans=0.125 2023-10-12 18:41:21,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1146488.0, ans=0.0 2023-10-12 18:41:27,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1146488.0, ans=0.125 2023-10-12 18:41:41,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.736e+02 1.972e+02 2.196e+02 3.311e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-12 18:41:47,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1146581.3333333333, ans=0.125 2023-10-12 18:41:48,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1146581.3333333333, ans=0.125 2023-10-12 18:41:55,742 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.45 vs. limit=10.0 2023-10-12 18:41:59,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1146628.0, ans=0.0 2023-10-12 18:42:52,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1146861.3333333333, ans=0.125 2023-10-12 18:42:54,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1146861.3333333333, ans=0.0 2023-10-12 18:43:01,438 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-10-12 18:43:21,246 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-18.pt 2023-10-12 18:43:55,200 INFO [train.py:1031] (0/4) Epoch 19, batch 0, loss[loss=0.179, simple_loss=0.2733, pruned_loss=0.0424, over 16622.00 frames. ], tot_loss[loss=0.179, simple_loss=0.2733, pruned_loss=0.0424, over 16622.00 frames. ], batch size: 219, lr: 1.85e-03, grad_scale: 32.0 2023-10-12 18:43:55,201 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-12 18:44:03,017 INFO [train.py:1063] (0/4) Epoch 19, validation: loss=0.2139, simple_loss=0.301, pruned_loss=0.06343, over 1020973.00 frames. 2023-10-12 18:44:03,018 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-12 18:44:06,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1147024.6666666667, ans=0.125 2023-10-12 18:44:07,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.748e+02 1.918e+02 2.200e+02 3.068e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-12 18:44:45,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1147164.6666666667, ans=0.125 2023-10-12 18:44:54,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.94 vs. limit=12.0 2023-10-12 18:45:02,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1147258.0, ans=0.1 2023-10-12 18:45:12,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1147304.6666666667, ans=0.04949747468305833 2023-10-12 18:45:12,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1147304.6666666667, ans=0.0 2023-10-12 18:45:24,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1147304.6666666667, ans=0.125 2023-10-12 18:45:24,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1147351.3333333333, ans=0.125 2023-10-12 18:45:59,229 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-10-12 18:45:59,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1147491.3333333333, ans=0.035 2023-10-12 18:46:03,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.654e+02 1.823e+02 1.995e+02 2.699e+02, threshold=3.645e+02, percent-clipped=0.0 2023-10-12 18:46:05,646 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.43 vs. limit=15.0 2023-10-12 18:46:17,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1147538.0, ans=0.0 2023-10-12 18:46:18,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.73 vs. limit=15.0 2023-10-12 18:46:23,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1147584.6666666667, ans=6.0 2023-10-12 18:46:24,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1147584.6666666667, ans=0.2 2023-10-12 18:46:36,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-10-12 18:46:39,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1147631.3333333333, ans=0.1 2023-10-12 18:47:00,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1147724.6666666667, ans=0.1 2023-10-12 18:47:09,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1147771.3333333333, ans=0.2 2023-10-12 18:47:10,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1147771.3333333333, ans=0.125 2023-10-12 18:47:46,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1147911.3333333333, ans=0.125 2023-10-12 18:47:46,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1147958.0, ans=0.125 2023-10-12 18:47:52,785 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.53 vs. limit=15.0 2023-10-12 18:47:54,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.775e+02 1.952e+02 2.207e+02 2.966e+02, threshold=3.904e+02, percent-clipped=0.0 2023-10-12 18:47:54,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1147958.0, ans=0.125 2023-10-12 18:47:55,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1147958.0, ans=0.1 2023-10-12 18:48:03,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1148004.6666666667, ans=0.125 2023-10-12 18:48:03,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1148004.6666666667, ans=0.04949747468305833 2023-10-12 18:48:32,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1148098.0, ans=0.2 2023-10-12 18:48:59,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1148238.0, ans=0.125 2023-10-12 18:49:01,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1148238.0, ans=0.125 2023-10-12 18:49:03,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-10-12 18:49:06,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1148238.0, ans=0.0 2023-10-12 18:49:36,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1148378.0, ans=0.2 2023-10-12 18:49:39,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1148378.0, ans=0.1 2023-10-12 18:49:41,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1148378.0, ans=0.125 2023-10-12 18:49:44,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1148424.6666666667, ans=0.0 2023-10-12 18:49:46,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1148424.6666666667, ans=0.125 2023-10-12 18:49:48,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.669e+02 1.883e+02 2.102e+02 2.824e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-12 18:49:54,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1148471.3333333333, ans=0.125 2023-10-12 18:50:14,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1148518.0, ans=0.125 2023-10-12 18:50:23,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1148564.6666666667, ans=0.2 2023-10-12 18:51:24,990 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.64 vs. limit=6.0 2023-10-12 18:51:26,162 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.91 vs. limit=15.0 2023-10-12 18:51:27,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-10-12 18:51:29,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1148891.3333333333, ans=0.1 2023-10-12 18:51:29,921 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.77 vs. limit=15.0 2023-10-12 18:51:35,488 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.774e+02 1.930e+02 2.181e+02 3.236e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-12 18:52:23,351 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-10-12 18:52:34,163 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.05 vs. limit=22.5 2023-10-12 18:52:40,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1149171.3333333333, ans=0.0 2023-10-12 18:52:57,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1149218.0, ans=0.1 2023-10-12 18:52:59,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1149218.0, ans=0.125 2023-10-12 18:53:01,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1149264.6666666667, ans=0.0 2023-10-12 18:53:11,138 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.26 vs. limit=22.5 2023-10-12 18:53:12,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1149264.6666666667, ans=0.125 2023-10-12 18:53:15,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1149311.3333333333, ans=0.0 2023-10-12 18:53:25,932 INFO [train.py:1031] (0/4) Epoch 19, batch 500, loss[loss=0.2001, simple_loss=0.2571, pruned_loss=0.07156, over 12451.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2828, pruned_loss=0.05005, over 7274207.98 frames. ], batch size: 440, lr: 1.85e-03, grad_scale: 16.0 2023-10-12 18:53:31,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1149358.0, ans=0.1 2023-10-12 18:53:32,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.904e+02 2.168e+02 2.566e+02 3.752e+02, threshold=4.337e+02, percent-clipped=0.0 2023-10-12 18:53:39,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1149404.6666666667, ans=0.035 2023-10-12 18:53:46,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1149404.6666666667, ans=0.2 2023-10-12 18:53:57,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.66 vs. limit=15.0 2023-10-12 18:54:00,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1149498.0, ans=0.0 2023-10-12 18:54:10,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1149544.6666666667, ans=0.07 2023-10-12 18:54:23,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1149591.3333333333, ans=0.125 2023-10-12 18:54:39,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1149638.0, ans=0.125 2023-10-12 18:54:39,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1149638.0, ans=0.2 2023-10-12 18:54:48,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1149684.6666666667, ans=0.125 2023-10-12 18:55:10,574 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:55:10,971 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.88 vs. limit=15.0 2023-10-12 18:55:12,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1149778.0, ans=0.2 2023-10-12 18:55:19,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1149778.0, ans=0.05 2023-10-12 18:55:19,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1149778.0, ans=0.125 2023-10-12 18:55:20,652 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.52 vs. limit=10.0 2023-10-12 18:55:28,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.762e+02 1.890e+02 2.119e+02 2.858e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-12 18:55:47,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-10-12 18:56:25,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.27 vs. limit=10.0 2023-10-12 18:56:44,215 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.75 vs. limit=22.5 2023-10-12 18:57:01,770 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.28 vs. limit=15.0 2023-10-12 18:57:02,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1150198.0, ans=0.0 2023-10-12 18:57:07,272 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.86 vs. limit=22.5 2023-10-12 18:57:13,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1150244.6666666667, ans=0.05 2023-10-12 18:57:21,358 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.829e+02 2.035e+02 2.382e+02 3.400e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-12 18:57:33,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1150338.0, ans=0.0 2023-10-12 18:57:33,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1150338.0, ans=0.125 2023-10-12 18:57:38,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1150384.6666666667, ans=0.04949747468305833 2023-10-12 18:57:41,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1150384.6666666667, ans=0.2 2023-10-12 18:58:14,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.63 vs. limit=10.0 2023-10-12 18:58:21,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1150524.6666666667, ans=0.125 2023-10-12 18:58:32,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1150571.3333333333, ans=0.125 2023-10-12 18:58:43,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1150618.0, ans=0.125 2023-10-12 18:58:46,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1150664.6666666667, ans=0.125 2023-10-12 18:58:48,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-10-12 18:59:00,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1150711.3333333333, ans=0.2 2023-10-12 18:59:15,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.768e+02 1.950e+02 2.164e+02 2.971e+02, threshold=3.901e+02, percent-clipped=0.0 2023-10-12 18:59:19,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.29 vs. limit=15.0 2023-10-12 18:59:21,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1150804.6666666667, ans=0.125 2023-10-12 18:59:55,701 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 18:59:59,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1150898.0, ans=0.0 2023-10-12 19:00:01,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1150944.6666666667, ans=0.0 2023-10-12 19:00:06,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1150944.6666666667, ans=0.125 2023-10-12 19:00:08,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1150944.6666666667, ans=0.125 2023-10-12 19:00:25,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1151038.0, ans=0.1 2023-10-12 19:00:48,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1151131.3333333333, ans=0.2 2023-10-12 19:00:57,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1151178.0, ans=0.125 2023-10-12 19:01:02,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1151178.0, ans=0.0 2023-10-12 19:01:18,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.744e+02 1.908e+02 2.110e+02 3.156e+02, threshold=3.815e+02, percent-clipped=0.0 2023-10-12 19:01:19,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1151271.3333333333, ans=0.0 2023-10-12 19:01:22,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1151271.3333333333, ans=0.125 2023-10-12 19:01:22,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1151271.3333333333, ans=15.0 2023-10-12 19:01:24,048 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.53 vs. limit=15.0 2023-10-12 19:01:26,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1151271.3333333333, ans=0.125 2023-10-12 19:01:28,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1151271.3333333333, ans=0.125 2023-10-12 19:01:37,194 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.56 vs. limit=12.0 2023-10-12 19:01:39,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1151318.0, ans=0.1 2023-10-12 19:01:39,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1151318.0, ans=0.125 2023-10-12 19:02:08,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.93 vs. limit=6.0 2023-10-12 19:02:14,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1151458.0, ans=0.0 2023-10-12 19:02:25,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1151504.6666666667, ans=0.0 2023-10-12 19:02:26,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1151551.3333333333, ans=0.0 2023-10-12 19:02:33,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1151551.3333333333, ans=0.125 2023-10-12 19:02:45,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1151598.0, ans=0.125 2023-10-12 19:02:56,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1151644.6666666667, ans=0.125 2023-10-12 19:03:00,227 INFO [train.py:1031] (0/4) Epoch 19, batch 1000, loss[loss=0.1963, simple_loss=0.2823, pruned_loss=0.05518, over 16871.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.284, pruned_loss=0.05082, over 12915129.16 frames. ], batch size: 77, lr: 1.85e-03, grad_scale: 8.0 2023-10-12 19:03:05,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1151691.3333333333, ans=0.2 2023-10-12 19:03:07,502 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.693e+02 1.886e+02 2.082e+02 2.751e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-12 19:03:15,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1151738.0, ans=0.0 2023-10-12 19:03:51,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1151924.6666666667, ans=0.1 2023-10-12 19:03:53,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1151924.6666666667, ans=0.2 2023-10-12 19:03:58,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1151924.6666666667, ans=0.125 2023-10-12 19:03:58,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1151924.6666666667, ans=0.0 2023-10-12 19:04:09,827 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:04:22,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152064.6666666667, ans=0.1 2023-10-12 19:04:35,859 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.81 vs. limit=8.0 2023-10-12 19:04:55,894 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.845e+02 2.010e+02 2.263e+02 2.871e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-12 19:05:11,911 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:05:21,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1152251.3333333333, ans=0.125 2023-10-12 19:06:27,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1152484.6666666667, ans=0.1 2023-10-12 19:06:35,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1152531.3333333333, ans=0.0 2023-10-12 19:06:42,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1152578.0, ans=0.2 2023-10-12 19:06:54,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1152624.6666666667, ans=0.2 2023-10-12 19:06:59,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1152624.6666666667, ans=0.125 2023-10-12 19:07:01,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1152624.6666666667, ans=0.1 2023-10-12 19:07:02,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.710e+02 1.940e+02 2.240e+02 3.269e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-12 19:07:33,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-10-12 19:07:39,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1152811.3333333333, ans=0.125 2023-10-12 19:07:40,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1152811.3333333333, ans=0.0 2023-10-12 19:07:42,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1152811.3333333333, ans=0.125 2023-10-12 19:08:19,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1152951.3333333333, ans=0.0 2023-10-12 19:08:29,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1152998.0, ans=0.125 2023-10-12 19:08:52,020 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.21 vs. limit=15.0 2023-10-12 19:08:52,360 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.766e+02 1.857e+02 2.030e+02 2.896e+02, threshold=3.714e+02, percent-clipped=0.0 2023-10-12 19:08:53,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1153138.0, ans=0.0 2023-10-12 19:09:14,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1153184.6666666667, ans=0.1 2023-10-12 19:09:18,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1153231.3333333333, ans=0.125 2023-10-12 19:09:44,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1153324.6666666667, ans=0.0 2023-10-12 19:09:50,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1153371.3333333333, ans=0.1 2023-10-12 19:10:05,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1153418.0, ans=0.2 2023-10-12 19:10:27,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1153511.3333333333, ans=0.1 2023-10-12 19:10:41,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1153558.0, ans=0.0 2023-10-12 19:10:48,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.712e+02 1.902e+02 2.147e+02 3.147e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-12 19:11:41,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1153791.3333333333, ans=0.1 2023-10-12 19:11:51,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1153838.0, ans=0.0 2023-10-12 19:11:57,494 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.36 vs. limit=22.5 2023-10-12 19:12:14,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1153884.6666666667, ans=0.0 2023-10-12 19:12:16,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1153931.3333333333, ans=0.125 2023-10-12 19:12:41,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1153978.0, ans=0.125 2023-10-12 19:12:43,412 INFO [train.py:1031] (0/4) Epoch 19, batch 1500, loss[loss=0.1782, simple_loss=0.2731, pruned_loss=0.04168, over 16647.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2822, pruned_loss=0.04987, over 17318470.44 frames. ], batch size: 202, lr: 1.85e-03, grad_scale: 8.0 2023-10-12 19:12:43,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1154024.6666666667, ans=0.0 2023-10-12 19:12:45,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1154024.6666666667, ans=0.125 2023-10-12 19:12:50,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1154024.6666666667, ans=0.125 2023-10-12 19:12:52,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.756e+02 1.923e+02 2.156e+02 3.062e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-12 19:13:12,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1154118.0, ans=0.125 2023-10-12 19:13:16,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2023-10-12 19:13:26,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1154164.6666666667, ans=0.2 2023-10-12 19:13:28,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1154164.6666666667, ans=0.125 2023-10-12 19:13:39,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1154211.3333333333, ans=0.2 2023-10-12 19:13:43,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1154258.0, ans=10.0 2023-10-12 19:13:59,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1154304.6666666667, ans=0.125 2023-10-12 19:14:06,089 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:14:08,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154351.3333333333, ans=0.1 2023-10-12 19:14:19,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.54 vs. limit=15.0 2023-10-12 19:14:36,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1154444.6666666667, ans=0.0 2023-10-12 19:14:49,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.702e+02 1.870e+02 2.140e+02 3.340e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-12 19:14:50,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1154538.0, ans=0.5 2023-10-12 19:15:02,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154584.6666666667, ans=0.1 2023-10-12 19:15:05,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1154584.6666666667, ans=0.2 2023-10-12 19:15:18,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1154631.3333333333, ans=0.0 2023-10-12 19:15:42,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1154724.6666666667, ans=0.125 2023-10-12 19:16:20,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154864.6666666667, ans=0.1 2023-10-12 19:16:21,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1154864.6666666667, ans=0.125 2023-10-12 19:16:21,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1154864.6666666667, ans=0.125 2023-10-12 19:16:44,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1154958.0, ans=0.125 2023-10-12 19:16:47,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.767e+02 1.957e+02 2.260e+02 3.832e+02, threshold=3.914e+02, percent-clipped=1.0 2023-10-12 19:16:55,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1155004.6666666667, ans=0.2 2023-10-12 19:17:04,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1155051.3333333333, ans=0.125 2023-10-12 19:17:43,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-10-12 19:17:48,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1155238.0, ans=0.125 2023-10-12 19:17:51,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1155238.0, ans=0.0 2023-10-12 19:18:25,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1155378.0, ans=0.035 2023-10-12 19:18:25,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1155378.0, ans=0.125 2023-10-12 19:18:26,070 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.48 vs. limit=15.0 2023-10-12 19:18:45,884 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.728e+02 1.856e+02 2.093e+02 2.791e+02, threshold=3.711e+02, percent-clipped=0.0 2023-10-12 19:18:47,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1155471.3333333333, ans=0.0 2023-10-12 19:19:06,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1155518.0, ans=0.125 2023-10-12 19:19:07,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1155564.6666666667, ans=0.125 2023-10-12 19:19:24,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1155611.3333333333, ans=0.0 2023-10-12 19:19:42,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1155704.6666666667, ans=0.125 2023-10-12 19:19:51,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-10-12 19:20:01,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1155751.3333333333, ans=0.2 2023-10-12 19:20:02,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1155751.3333333333, ans=0.0 2023-10-12 19:20:39,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.733e+02 1.939e+02 2.115e+02 2.572e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-12 19:20:45,025 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.51 vs. limit=15.0 2023-10-12 19:20:46,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1155938.0, ans=0.125 2023-10-12 19:21:01,172 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=12.0 2023-10-12 19:21:19,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1156078.0, ans=0.125 2023-10-12 19:21:42,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-10-12 19:22:13,562 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.23 vs. limit=15.0 2023-10-12 19:22:34,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1156311.3333333333, ans=0.125 2023-10-12 19:22:39,457 INFO [train.py:1031] (0/4) Epoch 19, batch 2000, loss[loss=0.1712, simple_loss=0.2686, pruned_loss=0.03694, over 15911.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2829, pruned_loss=0.05009, over 20756011.56 frames. ], batch size: 43, lr: 1.84e-03, grad_scale: 32.0 2023-10-12 19:22:44,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1156358.0, ans=0.1 2023-10-12 19:22:50,563 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.774e+02 1.935e+02 2.138e+02 2.712e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-12 19:22:59,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1156404.6666666667, ans=0.05 2023-10-12 19:23:22,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1156498.0, ans=0.0 2023-10-12 19:23:36,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1156544.6666666667, ans=0.125 2023-10-12 19:24:04,954 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.32 vs. limit=12.0 2023-10-12 19:24:22,871 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:24:26,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1156731.3333333333, ans=0.2 2023-10-12 19:24:48,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.73 vs. limit=22.5 2023-10-12 19:24:55,408 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.50 vs. limit=15.0 2023-10-12 19:25:09,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1156824.6666666667, ans=0.09899494936611666 2023-10-12 19:25:19,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.723e+02 1.888e+02 2.197e+02 3.185e+02, threshold=3.777e+02, percent-clipped=0.0 2023-10-12 19:25:24,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1156871.3333333333, ans=0.125 2023-10-12 19:25:24,809 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.46 vs. limit=10.0 2023-10-12 19:25:33,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1156871.3333333333, ans=0.125 2023-10-12 19:25:46,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156918.0, ans=0.1 2023-10-12 19:25:58,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1156964.6666666667, ans=0.1 2023-10-12 19:26:19,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1157011.3333333333, ans=0.2 2023-10-12 19:26:27,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1157058.0, ans=0.2 2023-10-12 19:27:07,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1157198.0, ans=0.125 2023-10-12 19:27:11,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1157244.6666666667, ans=0.125 2023-10-12 19:27:33,806 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-248000.pt 2023-10-12 19:27:37,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1157291.3333333333, ans=0.2 2023-10-12 19:27:39,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.810e+02 2.043e+02 2.281e+02 3.025e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-12 19:27:48,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1157384.6666666667, ans=0.125 2023-10-12 19:27:59,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1157384.6666666667, ans=0.0 2023-10-12 19:28:02,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1157431.3333333333, ans=0.1 2023-10-12 19:28:02,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1157431.3333333333, ans=0.125 2023-10-12 19:28:06,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1157431.3333333333, ans=0.125 2023-10-12 19:28:14,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1157478.0, ans=0.125 2023-10-12 19:28:33,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1157524.6666666667, ans=0.0 2023-10-12 19:28:40,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1157571.3333333333, ans=0.125 2023-10-12 19:28:48,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1157571.3333333333, ans=0.5 2023-10-12 19:28:54,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=12.0 2023-10-12 19:29:22,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1157758.0, ans=0.125 2023-10-12 19:29:33,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.861e+02 2.037e+02 2.495e+02 4.157e+02, threshold=4.073e+02, percent-clipped=1.0 2023-10-12 19:29:33,950 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.71 vs. limit=15.0 2023-10-12 19:29:47,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1157851.3333333333, ans=0.2 2023-10-12 19:29:49,264 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.33 vs. limit=15.0 2023-10-12 19:29:59,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1157898.0, ans=0.2 2023-10-12 19:30:20,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1157991.3333333333, ans=0.125 2023-10-12 19:30:21,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1157991.3333333333, ans=0.2 2023-10-12 19:30:27,408 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.09 vs. limit=15.0 2023-10-12 19:30:45,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1158084.6666666667, ans=0.125 2023-10-12 19:30:59,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1158131.3333333333, ans=0.125 2023-10-12 19:31:29,155 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.810e+02 1.946e+02 2.126e+02 2.830e+02, threshold=3.891e+02, percent-clipped=0.0 2023-10-12 19:31:35,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1158271.3333333333, ans=0.125 2023-10-12 19:31:53,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1158364.6666666667, ans=0.125 2023-10-12 19:31:55,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.97 vs. limit=15.0 2023-10-12 19:32:14,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1158458.0, ans=0.0 2023-10-12 19:32:21,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1158458.0, ans=0.0 2023-10-12 19:32:22,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1158458.0, ans=0.0 2023-10-12 19:32:27,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1158504.6666666667, ans=0.125 2023-10-12 19:32:40,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1158551.3333333333, ans=0.125 2023-10-12 19:33:00,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1158644.6666666667, ans=0.025 2023-10-12 19:33:09,031 INFO [train.py:1031] (0/4) Epoch 19, batch 2500, loss[loss=0.1904, simple_loss=0.2554, pruned_loss=0.0627, over 12511.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2832, pruned_loss=0.05025, over 23443536.20 frames. ], batch size: 440, lr: 1.84e-03, grad_scale: 16.0 2023-10-12 19:33:11,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1158691.3333333333, ans=0.125 2023-10-12 19:33:11,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1158691.3333333333, ans=0.5 2023-10-12 19:33:14,353 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.22 vs. limit=10.0 2023-10-12 19:33:19,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1158738.0, ans=0.125 2023-10-12 19:33:22,197 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.734e+02 1.883e+02 2.082e+02 3.087e+02, threshold=3.765e+02, percent-clipped=0.0 2023-10-12 19:33:27,447 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-10-12 19:33:55,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1158878.0, ans=0.0 2023-10-12 19:34:02,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1158924.6666666667, ans=0.125 2023-10-12 19:34:21,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1158971.3333333333, ans=0.0 2023-10-12 19:34:22,875 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:34:37,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1159064.6666666667, ans=0.1 2023-10-12 19:34:50,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1159111.3333333333, ans=0.2 2023-10-12 19:34:58,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1159158.0, ans=0.0 2023-10-12 19:35:04,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1159158.0, ans=0.125 2023-10-12 19:35:08,896 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.816e+02 1.986e+02 2.145e+02 3.249e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-12 19:35:25,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-10-12 19:35:36,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1159298.0, ans=0.125 2023-10-12 19:35:41,624 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2023-10-12 19:35:48,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1159344.6666666667, ans=0.125 2023-10-12 19:35:52,496 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:35:58,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1159391.3333333333, ans=0.125 2023-10-12 19:36:01,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1159391.3333333333, ans=0.0 2023-10-12 19:36:32,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1159531.3333333333, ans=0.0 2023-10-12 19:36:39,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1159531.3333333333, ans=0.125 2023-10-12 19:36:43,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1159578.0, ans=0.0 2023-10-12 19:36:50,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1159578.0, ans=0.125 2023-10-12 19:36:55,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1159624.6666666667, ans=0.1 2023-10-12 19:37:00,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1159624.6666666667, ans=0.125 2023-10-12 19:37:09,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.741e+02 1.953e+02 2.159e+02 3.231e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-12 19:37:11,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1159671.3333333333, ans=0.1 2023-10-12 19:37:51,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1159811.3333333333, ans=0.95 2023-10-12 19:38:02,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1159858.0, ans=0.125 2023-10-12 19:38:07,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1159858.0, ans=0.125 2023-10-12 19:38:15,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1159904.6666666667, ans=0.0 2023-10-12 19:38:25,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1159951.3333333333, ans=0.125 2023-10-12 19:38:38,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1159998.0, ans=0.0 2023-10-12 19:38:57,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1160044.6666666667, ans=0.04949747468305833 2023-10-12 19:39:12,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.712e+02 1.959e+02 2.214e+02 3.268e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-12 19:39:12,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1160138.0, ans=0.0 2023-10-12 19:40:06,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1160278.0, ans=10.0 2023-10-12 19:40:40,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1160418.0, ans=0.025 2023-10-12 19:40:53,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1160464.6666666667, ans=0.125 2023-10-12 19:41:00,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1160464.6666666667, ans=0.0 2023-10-12 19:41:17,526 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:41:23,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1160558.0, ans=0.1 2023-10-12 19:41:29,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1160604.6666666667, ans=0.125 2023-10-12 19:41:29,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.745e+02 1.895e+02 2.031e+02 2.974e+02, threshold=3.791e+02, percent-clipped=0.0 2023-10-12 19:41:35,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1160604.6666666667, ans=0.2 2023-10-12 19:42:09,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1160744.6666666667, ans=0.0 2023-10-12 19:42:30,018 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.53 vs. limit=22.5 2023-10-12 19:42:30,802 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:42:34,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1160884.6666666667, ans=0.0 2023-10-12 19:42:38,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1160884.6666666667, ans=0.125 2023-10-12 19:42:55,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1160978.0, ans=0.125 2023-10-12 19:42:59,112 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-10-12 19:43:05,383 INFO [train.py:1031] (0/4) Epoch 19, batch 3000, loss[loss=0.181, simple_loss=0.2771, pruned_loss=0.04245, over 16814.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2826, pruned_loss=0.05039, over 25525553.75 frames. ], batch size: 87, lr: 1.84e-03, grad_scale: 16.0 2023-10-12 19:43:13,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1161024.6666666667, ans=0.125 2023-10-12 19:43:15,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1161024.6666666667, ans=0.125 2023-10-12 19:43:18,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161071.3333333333, ans=0.1 2023-10-12 19:43:18,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.729e+02 1.936e+02 2.188e+02 3.279e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-12 19:43:22,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1161071.3333333333, ans=0.2 2023-10-12 19:43:28,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1161118.0, ans=0.0 2023-10-12 19:43:39,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1161164.6666666667, ans=0.125 2023-10-12 19:43:43,961 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.20 vs. limit=15.0 2023-10-12 19:43:58,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1161211.3333333333, ans=0.0 2023-10-12 19:43:59,011 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-10-12 19:44:29,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1161351.3333333333, ans=0.125 2023-10-12 19:44:32,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1161351.3333333333, ans=0.0 2023-10-12 19:44:37,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1161398.0, ans=0.125 2023-10-12 19:44:44,494 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:44:45,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1161444.6666666667, ans=0.125 2023-10-12 19:44:55,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1161444.6666666667, ans=0.0 2023-10-12 19:45:11,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1161491.3333333333, ans=0.125 2023-10-12 19:45:18,390 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.825e+02 2.031e+02 2.305e+02 3.403e+02, threshold=4.062e+02, percent-clipped=0.0 2023-10-12 19:45:25,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1161538.0, ans=0.0 2023-10-12 19:45:29,121 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.57 vs. limit=22.5 2023-10-12 19:45:38,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1161631.3333333333, ans=0.0 2023-10-12 19:45:44,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1161631.3333333333, ans=0.0 2023-10-12 19:45:52,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1161678.0, ans=0.2 2023-10-12 19:45:59,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1161724.6666666667, ans=0.125 2023-10-12 19:46:00,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1161724.6666666667, ans=0.2 2023-10-12 19:46:04,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1161724.6666666667, ans=0.2 2023-10-12 19:46:32,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1161864.6666666667, ans=0.125 2023-10-12 19:46:41,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1161864.6666666667, ans=0.07 2023-10-12 19:46:47,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1161911.3333333333, ans=0.125 2023-10-12 19:47:09,699 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.696e+02 1.877e+02 2.136e+02 3.018e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-12 19:47:36,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1162098.0, ans=0.0 2023-10-12 19:48:28,820 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.05 vs. limit=22.5 2023-10-12 19:48:58,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1162378.0, ans=0.2 2023-10-12 19:49:00,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1162378.0, ans=0.125 2023-10-12 19:49:04,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1162424.6666666667, ans=0.2 2023-10-12 19:49:08,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1162424.6666666667, ans=0.0 2023-10-12 19:49:15,220 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:49:15,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.713e+02 1.887e+02 2.016e+02 3.327e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-12 19:49:17,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1162471.3333333333, ans=0.125 2023-10-12 19:49:18,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1162471.3333333333, ans=0.1 2023-10-12 19:49:49,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1162611.3333333333, ans=0.125 2023-10-12 19:49:52,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1162611.3333333333, ans=0.2 2023-10-12 19:50:06,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1162658.0, ans=0.0 2023-10-12 19:50:22,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1162704.6666666667, ans=0.02 2023-10-12 19:50:27,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1162751.3333333333, ans=0.5 2023-10-12 19:50:30,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1162751.3333333333, ans=0.0 2023-10-12 19:50:49,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1162798.0, ans=0.1 2023-10-12 19:50:51,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1162844.6666666667, ans=0.125 2023-10-12 19:50:55,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1162844.6666666667, ans=0.2 2023-10-12 19:51:03,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1162891.3333333333, ans=0.07 2023-10-12 19:51:13,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.19 vs. limit=15.0 2023-10-12 19:51:18,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.748e+02 1.905e+02 2.115e+02 2.972e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-12 19:51:24,437 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.14 vs. limit=22.5 2023-10-12 19:51:29,060 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.65 vs. limit=15.0 2023-10-12 19:51:34,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1162984.6666666667, ans=0.2 2023-10-12 19:51:40,608 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 19:52:16,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1163171.3333333333, ans=0.2 2023-10-12 19:52:34,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1163264.6666666667, ans=0.1 2023-10-12 19:52:51,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.42 vs. limit=12.0 2023-10-12 19:52:59,189 INFO [train.py:1031] (0/4) Epoch 19, batch 3500, loss[loss=0.1931, simple_loss=0.2798, pruned_loss=0.05323, over 16904.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2826, pruned_loss=0.05045, over 27170712.91 frames. ], batch size: 110, lr: 1.84e-03, grad_scale: 16.0 2023-10-12 19:53:04,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1163358.0, ans=0.5 2023-10-12 19:53:11,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1163404.6666666667, ans=0.125 2023-10-12 19:53:14,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.696e+02 1.857e+02 2.050e+02 4.486e+02, threshold=3.715e+02, percent-clipped=1.0 2023-10-12 19:53:18,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1163404.6666666667, ans=0.2 2023-10-12 19:53:27,066 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-10-12 19:53:41,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1163498.0, ans=0.1 2023-10-12 19:53:41,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1163498.0, ans=0.125 2023-10-12 19:54:01,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1163591.3333333333, ans=0.125 2023-10-12 19:54:07,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1163591.3333333333, ans=0.0 2023-10-12 19:54:51,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.08 vs. limit=22.5 2023-10-12 19:55:22,876 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.780e+02 1.941e+02 2.125e+02 2.998e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-12 19:55:23,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1163871.3333333333, ans=0.035 2023-10-12 19:55:30,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1163918.0, ans=0.2 2023-10-12 19:55:51,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=1163964.6666666667, ans=0.1 2023-10-12 19:55:57,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1164011.3333333333, ans=0.0 2023-10-12 19:55:59,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1164011.3333333333, ans=0.05 2023-10-12 19:56:06,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1164058.0, ans=0.1 2023-10-12 19:56:21,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1164104.6666666667, ans=15.0 2023-10-12 19:56:26,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1164104.6666666667, ans=0.0 2023-10-12 19:56:42,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.96 vs. limit=6.0 2023-10-12 19:57:08,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1164291.3333333333, ans=0.125 2023-10-12 19:57:15,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.71 vs. limit=10.0 2023-10-12 19:57:19,587 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.735e+02 1.911e+02 2.167e+02 2.897e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-12 19:57:36,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1164384.6666666667, ans=0.1 2023-10-12 19:58:11,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1164478.0, ans=0.125 2023-10-12 19:58:19,327 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=15.0 2023-10-12 19:58:28,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1164571.3333333333, ans=0.1 2023-10-12 19:58:45,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1164618.0, ans=0.2 2023-10-12 19:58:51,406 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2023-10-12 19:58:52,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1164664.6666666667, ans=0.0 2023-10-12 19:59:08,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1164711.3333333333, ans=0.125 2023-10-12 19:59:16,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1164758.0, ans=0.125 2023-10-12 19:59:28,489 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.769e+02 1.898e+02 2.114e+02 2.786e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-12 19:59:32,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1164804.6666666667, ans=0.125 2023-10-12 19:59:50,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1164898.0, ans=0.125 2023-10-12 19:59:58,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1164898.0, ans=0.0 2023-10-12 19:59:58,544 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.18 vs. limit=15.0 2023-10-12 20:00:09,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1164944.6666666667, ans=0.125 2023-10-12 20:00:11,012 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-10-12 20:00:35,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1165038.0, ans=0.0 2023-10-12 20:00:45,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1165084.6666666667, ans=0.05 2023-10-12 20:00:53,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-10-12 20:01:00,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1165178.0, ans=0.95 2023-10-12 20:01:00,948 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.52 vs. limit=22.5 2023-10-12 20:01:22,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1165271.3333333333, ans=0.0 2023-10-12 20:01:26,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.729e+02 1.960e+02 2.149e+02 3.461e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-12 20:02:01,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1165411.3333333333, ans=0.0 2023-10-12 20:02:14,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1165504.6666666667, ans=0.125 2023-10-12 20:02:28,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1165551.3333333333, ans=0.0 2023-10-12 20:02:29,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1165551.3333333333, ans=0.0 2023-10-12 20:02:37,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1165598.0, ans=0.2 2023-10-12 20:02:39,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1165598.0, ans=0.5 2023-10-12 20:02:41,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1165598.0, ans=0.0 2023-10-12 20:02:48,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=8.0 2023-10-12 20:02:51,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1165644.6666666667, ans=0.2 2023-10-12 20:02:51,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1165644.6666666667, ans=0.1 2023-10-12 20:02:57,507 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-10-12 20:02:58,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1165644.6666666667, ans=10.0 2023-10-12 20:03:01,867 INFO [train.py:1031] (0/4) Epoch 19, batch 4000, loss[loss=0.1868, simple_loss=0.2752, pruned_loss=0.04915, over 16471.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2822, pruned_loss=0.05054, over 28410587.08 frames. ], batch size: 50, lr: 1.84e-03, grad_scale: 32.0 2023-10-12 20:03:05,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1165691.3333333333, ans=0.2 2023-10-12 20:03:13,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1165691.3333333333, ans=10.0 2023-10-12 20:03:21,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.753e+02 1.923e+02 2.126e+02 2.945e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-12 20:03:30,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-10-12 20:03:35,950 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:03:46,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1165831.3333333333, ans=0.1 2023-10-12 20:03:58,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1165878.0, ans=0.125 2023-10-12 20:03:59,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1165878.0, ans=0.0 2023-10-12 20:04:01,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1165924.6666666667, ans=0.125 2023-10-12 20:04:02,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1165924.6666666667, ans=0.05 2023-10-12 20:04:03,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1165924.6666666667, ans=0.1 2023-10-12 20:04:26,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1166018.0, ans=0.09899494936611666 2023-10-12 20:04:40,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1166064.6666666667, ans=0.0 2023-10-12 20:04:48,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1166111.3333333333, ans=0.125 2023-10-12 20:04:56,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1166111.3333333333, ans=0.125 2023-10-12 20:05:14,605 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.859e+02 1.970e+02 2.288e+02 3.382e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-12 20:05:41,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1166298.0, ans=0.125 2023-10-12 20:06:02,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1166391.3333333333, ans=0.125 2023-10-12 20:06:16,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1166438.0, ans=0.1 2023-10-12 20:06:19,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1166438.0, ans=0.2 2023-10-12 20:06:26,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.76 vs. limit=10.0 2023-10-12 20:06:49,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1166531.3333333333, ans=0.125 2023-10-12 20:07:02,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1166578.0, ans=0.125 2023-10-12 20:07:22,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1166671.3333333333, ans=0.1 2023-10-12 20:07:25,027 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.789e+02 1.937e+02 2.231e+02 2.968e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-12 20:07:58,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1166764.6666666667, ans=0.125 2023-10-12 20:08:03,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1166764.6666666667, ans=0.125 2023-10-12 20:08:21,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-10-12 20:08:28,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1166904.6666666667, ans=0.5 2023-10-12 20:08:40,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1166951.3333333333, ans=0.1 2023-10-12 20:08:44,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1166951.3333333333, ans=0.1 2023-10-12 20:08:44,575 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.78 vs. limit=15.0 2023-10-12 20:08:46,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1166951.3333333333, ans=0.5 2023-10-12 20:08:47,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1166951.3333333333, ans=0.0 2023-10-12 20:09:14,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.33 vs. limit=15.0 2023-10-12 20:09:26,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1167138.0, ans=0.125 2023-10-12 20:09:32,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.783e+02 1.908e+02 2.066e+02 2.889e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-12 20:09:36,766 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.40 vs. limit=15.0 2023-10-12 20:09:42,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1167184.6666666667, ans=0.125 2023-10-12 20:10:08,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1167324.6666666667, ans=0.0 2023-10-12 20:10:15,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1167324.6666666667, ans=0.0 2023-10-12 20:10:44,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1167464.6666666667, ans=0.2 2023-10-12 20:10:56,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-10-12 20:11:27,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.820e+02 1.974e+02 2.184e+02 2.854e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-12 20:11:34,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1167651.3333333333, ans=0.2 2023-10-12 20:11:35,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1167651.3333333333, ans=0.1 2023-10-12 20:11:40,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1167651.3333333333, ans=0.125 2023-10-12 20:11:42,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1167651.3333333333, ans=0.0 2023-10-12 20:12:03,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1167744.6666666667, ans=0.125 2023-10-12 20:12:24,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.92 vs. limit=12.0 2023-10-12 20:12:31,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1167838.0, ans=0.1 2023-10-12 20:12:37,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1167884.6666666667, ans=0.0 2023-10-12 20:12:38,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1167884.6666666667, ans=0.1 2023-10-12 20:12:43,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1167884.6666666667, ans=0.125 2023-10-12 20:12:48,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1167884.6666666667, ans=0.125 2023-10-12 20:12:54,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1167931.3333333333, ans=0.0 2023-10-12 20:13:04,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1167978.0, ans=0.2 2023-10-12 20:13:12,732 INFO [train.py:1031] (0/4) Epoch 19, batch 4500, loss[loss=0.1954, simple_loss=0.291, pruned_loss=0.04985, over 16808.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2824, pruned_loss=0.0503, over 29391161.46 frames. ], batch size: 188, lr: 1.83e-03, grad_scale: 16.0 2023-10-12 20:13:28,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.77 vs. limit=15.0 2023-10-12 20:13:30,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.720e+02 1.897e+02 2.084e+02 2.769e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-12 20:13:56,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1168211.3333333333, ans=0.0 2023-10-12 20:14:14,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-10-12 20:14:25,585 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.07 vs. limit=15.0 2023-10-12 20:14:56,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1168444.6666666667, ans=0.0 2023-10-12 20:14:57,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1168444.6666666667, ans=0.1 2023-10-12 20:15:20,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.764e+02 1.964e+02 2.300e+02 3.011e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-12 20:15:29,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1168584.6666666667, ans=0.2 2023-10-12 20:15:48,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1168678.0, ans=0.2 2023-10-12 20:15:54,368 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-10-12 20:15:54,899 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:15:57,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1168724.6666666667, ans=0.0 2023-10-12 20:16:01,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1168724.6666666667, ans=0.125 2023-10-12 20:16:07,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1168771.3333333333, ans=0.1 2023-10-12 20:16:08,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1168771.3333333333, ans=0.125 2023-10-12 20:16:26,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1168818.0, ans=0.125 2023-10-12 20:16:34,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1168864.6666666667, ans=0.0 2023-10-12 20:16:35,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1168864.6666666667, ans=0.1 2023-10-12 20:16:35,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1168864.6666666667, ans=0.125 2023-10-12 20:16:45,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1168911.3333333333, ans=0.125 2023-10-12 20:16:52,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1168958.0, ans=0.125 2023-10-12 20:17:11,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1168958.0, ans=0.95 2023-10-12 20:17:18,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1169004.6666666667, ans=0.1 2023-10-12 20:17:20,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.738e+02 1.884e+02 2.098e+02 2.757e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-12 20:17:27,930 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.83 vs. limit=15.0 2023-10-12 20:17:48,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1169144.6666666667, ans=0.125 2023-10-12 20:18:31,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1169331.3333333333, ans=0.0 2023-10-12 20:18:43,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1169378.0, ans=0.0 2023-10-12 20:18:57,689 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.19 vs. limit=15.0 2023-10-12 20:18:58,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1169424.6666666667, ans=0.0 2023-10-12 20:18:58,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.34 vs. limit=15.0 2023-10-12 20:19:12,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.663e+02 1.834e+02 2.005e+02 2.813e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-12 20:19:26,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1169518.0, ans=0.035 2023-10-12 20:19:38,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1169564.6666666667, ans=0.95 2023-10-12 20:19:55,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1169658.0, ans=0.125 2023-10-12 20:19:58,040 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2023-10-12 20:20:06,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1169704.6666666667, ans=0.0 2023-10-12 20:20:07,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1169704.6666666667, ans=0.1 2023-10-12 20:20:19,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.66 vs. limit=22.5 2023-10-12 20:20:52,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1169891.3333333333, ans=0.2 2023-10-12 20:21:09,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.788e+02 1.990e+02 2.211e+02 3.307e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-12 20:21:10,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1169938.0, ans=0.125 2023-10-12 20:21:18,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1169984.6666666667, ans=0.0 2023-10-12 20:21:21,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1169984.6666666667, ans=0.125 2023-10-12 20:21:21,608 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-10-12 20:21:23,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1170031.3333333333, ans=0.0 2023-10-12 20:21:25,267 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-10-12 20:22:04,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1170171.3333333333, ans=0.0 2023-10-12 20:22:22,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1170264.6666666667, ans=0.5 2023-10-12 20:22:22,524 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=15.0 2023-10-12 20:22:28,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.73 vs. limit=22.5 2023-10-12 20:22:40,202 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:22:45,259 INFO [train.py:1031] (0/4) Epoch 19, batch 5000, loss[loss=0.2174, simple_loss=0.3026, pruned_loss=0.06611, over 16565.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2821, pruned_loss=0.05032, over 30153334.61 frames. ], batch size: 219, lr: 1.83e-03, grad_scale: 32.0 2023-10-12 20:22:47,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1170358.0, ans=0.125 2023-10-12 20:23:04,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.782e+02 1.990e+02 2.188e+02 3.191e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-12 20:23:15,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1170451.3333333333, ans=0.125 2023-10-12 20:23:32,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1170544.6666666667, ans=0.0 2023-10-12 20:23:44,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1170591.3333333333, ans=0.125 2023-10-12 20:24:03,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1170684.6666666667, ans=0.0 2023-10-12 20:24:04,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1170684.6666666667, ans=0.125 2023-10-12 20:24:11,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1170684.6666666667, ans=0.1 2023-10-12 20:24:15,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1170731.3333333333, ans=0.2 2023-10-12 20:24:36,514 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:24:46,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1170824.6666666667, ans=0.1 2023-10-12 20:25:01,075 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.757e+02 1.915e+02 2.080e+02 3.362e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-12 20:25:03,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1170918.0, ans=0.2 2023-10-12 20:25:15,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1170964.6666666667, ans=0.0 2023-10-12 20:25:23,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=1170964.6666666667, ans=0.5 2023-10-12 20:25:33,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1171011.3333333333, ans=0.2 2023-10-12 20:25:51,098 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-10-12 20:25:51,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1171104.6666666667, ans=0.0 2023-10-12 20:26:02,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1171151.3333333333, ans=0.07 2023-10-12 20:26:15,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=12.0 2023-10-12 20:26:18,710 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.25 vs. limit=22.5 2023-10-12 20:26:46,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1171338.0, ans=0.125 2023-10-12 20:26:51,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1171338.0, ans=0.1 2023-10-12 20:26:53,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.794e+02 1.947e+02 2.243e+02 2.906e+02, threshold=3.894e+02, percent-clipped=0.0 2023-10-12 20:26:53,861 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.92 vs. limit=15.0 2023-10-12 20:27:20,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1171478.0, ans=0.0 2023-10-12 20:27:42,571 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.53 vs. limit=6.0 2023-10-12 20:27:47,889 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:28:00,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1171618.0, ans=0.0 2023-10-12 20:28:15,665 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-10-12 20:28:19,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1171711.3333333333, ans=0.95 2023-10-12 20:28:33,935 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.15 vs. limit=22.5 2023-10-12 20:28:46,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1171804.6666666667, ans=0.0 2023-10-12 20:28:50,688 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.677e+02 1.877e+02 2.102e+02 2.700e+02, threshold=3.755e+02, percent-clipped=0.0 2023-10-12 20:29:04,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1171851.3333333333, ans=0.0 2023-10-12 20:29:11,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1171898.0, ans=0.125 2023-10-12 20:29:29,218 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.94 vs. limit=15.0 2023-10-12 20:29:48,755 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.56 vs. limit=10.0 2023-10-12 20:30:17,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1172084.6666666667, ans=0.0 2023-10-12 20:30:26,084 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.97 vs. limit=15.0 2023-10-12 20:30:42,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1172178.0, ans=0.1 2023-10-12 20:30:42,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1172178.0, ans=0.1 2023-10-12 20:30:49,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1172224.6666666667, ans=0.125 2023-10-12 20:30:49,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.04 vs. limit=22.5 2023-10-12 20:30:50,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1172224.6666666667, ans=0.0 2023-10-12 20:31:05,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.718e+02 1.906e+02 2.162e+02 3.312e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-12 20:31:06,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.22 vs. limit=22.5 2023-10-12 20:31:27,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.94 vs. limit=12.0 2023-10-12 20:32:05,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1172458.0, ans=0.1 2023-10-12 20:32:40,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1172551.3333333333, ans=0.125 2023-10-12 20:32:53,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1172644.6666666667, ans=0.2 2023-10-12 20:33:01,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.45 vs. limit=15.0 2023-10-12 20:33:01,887 INFO [train.py:1031] (0/4) Epoch 19, batch 5500, loss[loss=0.1745, simple_loss=0.2698, pruned_loss=0.03963, over 16700.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2817, pruned_loss=0.05002, over 30724924.95 frames. ], batch size: 81, lr: 1.83e-03, grad_scale: 16.0 2023-10-12 20:33:11,355 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-10-12 20:33:13,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.91 vs. limit=10.0 2023-10-12 20:33:21,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1172738.0, ans=0.125 2023-10-12 20:33:21,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1172738.0, ans=0.125 2023-10-12 20:33:25,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.702e+02 1.922e+02 2.285e+02 3.676e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-12 20:33:34,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.58 vs. limit=12.0 2023-10-12 20:33:38,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.81 vs. limit=15.0 2023-10-12 20:33:50,429 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.12 vs. limit=15.0 2023-10-12 20:33:56,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1172878.0, ans=0.1 2023-10-12 20:34:06,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1172924.6666666667, ans=0.0 2023-10-12 20:34:08,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1172971.3333333333, ans=0.0 2023-10-12 20:34:19,305 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-10-12 20:34:22,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1173018.0, ans=0.0 2023-10-12 20:34:26,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1173018.0, ans=0.125 2023-10-12 20:34:30,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1173064.6666666667, ans=0.125 2023-10-12 20:34:30,225 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-10-12 20:34:33,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1173064.6666666667, ans=0.125 2023-10-12 20:34:59,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1173158.0, ans=0.0 2023-10-12 20:35:16,032 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.820e+02 2.005e+02 2.215e+02 3.353e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-12 20:35:20,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1173251.3333333333, ans=0.2 2023-10-12 20:35:30,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1173298.0, ans=0.0 2023-10-12 20:36:01,894 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:36:05,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1173391.3333333333, ans=0.025 2023-10-12 20:36:20,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-10-12 20:36:41,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1173578.0, ans=0.1 2023-10-12 20:36:52,131 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.00 vs. limit=10.0 2023-10-12 20:37:04,344 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.41 vs. limit=15.0 2023-10-12 20:37:11,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1173671.3333333333, ans=0.125 2023-10-12 20:37:15,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.714e+02 1.871e+02 2.033e+02 2.729e+02, threshold=3.743e+02, percent-clipped=0.0 2023-10-12 20:37:18,338 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:37:20,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1173718.0, ans=0.125 2023-10-12 20:37:22,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1173718.0, ans=0.1 2023-10-12 20:37:23,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1173718.0, ans=0.2 2023-10-12 20:37:46,113 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.16 vs. limit=15.0 2023-10-12 20:38:10,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1173904.6666666667, ans=0.125 2023-10-12 20:38:12,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1173904.6666666667, ans=0.125 2023-10-12 20:38:38,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-10-12 20:38:39,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1174044.6666666667, ans=0.1 2023-10-12 20:38:53,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1174091.3333333333, ans=0.1 2023-10-12 20:39:10,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1174138.0, ans=0.125 2023-10-12 20:39:12,971 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.781e+02 2.002e+02 2.350e+02 3.222e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-12 20:39:19,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1174184.6666666667, ans=0.0 2023-10-12 20:39:25,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1174184.6666666667, ans=0.125 2023-10-12 20:39:31,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1174231.3333333333, ans=0.2 2023-10-12 20:39:32,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1174231.3333333333, ans=0.1 2023-10-12 20:39:36,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1174231.3333333333, ans=0.2 2023-10-12 20:39:42,258 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.40 vs. limit=15.0 2023-10-12 20:39:45,020 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-10-12 20:39:49,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1174278.0, ans=0.125 2023-10-12 20:39:55,343 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.34 vs. limit=12.0 2023-10-12 20:40:18,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.42 vs. limit=15.0 2023-10-12 20:40:52,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1174558.0, ans=0.125 2023-10-12 20:41:02,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1174604.6666666667, ans=0.0 2023-10-12 20:41:04,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1174604.6666666667, ans=0.125 2023-10-12 20:41:18,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1174651.3333333333, ans=0.125 2023-10-12 20:41:19,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.793e+02 1.964e+02 2.220e+02 2.804e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-12 20:41:40,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1174698.0, ans=0.0 2023-10-12 20:41:48,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1174744.6666666667, ans=0.0 2023-10-12 20:41:49,338 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.67 vs. limit=10.0 2023-10-12 20:42:13,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1174838.0, ans=0.07 2023-10-12 20:42:16,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1174838.0, ans=0.2 2023-10-12 20:42:18,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1174884.6666666667, ans=0.125 2023-10-12 20:42:25,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1174884.6666666667, ans=0.125 2023-10-12 20:42:50,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1174978.0, ans=0.2 2023-10-12 20:42:51,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.22 vs. limit=15.0 2023-10-12 20:42:55,483 INFO [train.py:1031] (0/4) Epoch 19, batch 6000, loss[loss=0.2293, simple_loss=0.3094, pruned_loss=0.07461, over 16027.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2822, pruned_loss=0.05032, over 31200137.86 frames. ], batch size: 296, lr: 1.83e-03, grad_scale: 16.0 2023-10-12 20:43:03,139 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.18 vs. limit=12.0 2023-10-12 20:43:08,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-10-12 20:43:15,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1175071.3333333333, ans=0.125 2023-10-12 20:43:20,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.830e+02 2.020e+02 2.266e+02 4.107e+02, threshold=4.040e+02, percent-clipped=2.0 2023-10-12 20:43:21,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1175118.0, ans=0.09899494936611666 2023-10-12 20:43:42,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1175211.3333333333, ans=0.125 2023-10-12 20:43:50,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1175211.3333333333, ans=0.125 2023-10-12 20:44:05,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1175258.0, ans=0.0 2023-10-12 20:44:14,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1175304.6666666667, ans=0.2 2023-10-12 20:44:27,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1175351.3333333333, ans=0.05 2023-10-12 20:44:41,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1175444.6666666667, ans=0.125 2023-10-12 20:44:48,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1175444.6666666667, ans=0.125 2023-10-12 20:45:08,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1175538.0, ans=0.125 2023-10-12 20:45:13,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.719e+02 1.859e+02 2.078e+02 3.561e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-12 20:45:29,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1175631.3333333333, ans=0.1 2023-10-12 20:45:41,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1175678.0, ans=0.125 2023-10-12 20:45:43,451 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.274e-02 2023-10-12 20:45:46,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1175724.6666666667, ans=0.0 2023-10-12 20:45:52,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1175724.6666666667, ans=0.125 2023-10-12 20:45:54,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1175724.6666666667, ans=0.125 2023-10-12 20:46:29,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1175864.6666666667, ans=0.2 2023-10-12 20:46:36,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1175911.3333333333, ans=0.125 2023-10-12 20:46:40,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1175911.3333333333, ans=0.0 2023-10-12 20:46:53,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1175958.0, ans=0.1 2023-10-12 20:47:08,854 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.842e+02 2.003e+02 2.220e+02 3.919e+02, threshold=4.007e+02, percent-clipped=1.0 2023-10-12 20:47:10,162 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-10-12 20:47:22,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1176098.0, ans=0.125 2023-10-12 20:47:43,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1176144.6666666667, ans=0.0 2023-10-12 20:48:12,266 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.89 vs. limit=10.0 2023-10-12 20:48:19,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1176284.6666666667, ans=0.125 2023-10-12 20:48:29,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1176331.3333333333, ans=0.125 2023-10-12 20:48:51,771 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.12 vs. limit=22.5 2023-10-12 20:49:03,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1176471.3333333333, ans=0.125 2023-10-12 20:49:08,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1176518.0, ans=0.0 2023-10-12 20:49:08,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.867e+02 2.076e+02 2.307e+02 3.120e+02, threshold=4.153e+02, percent-clipped=0.0 2023-10-12 20:49:09,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1176518.0, ans=0.2 2023-10-12 20:49:58,158 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=15.0 2023-10-12 20:50:03,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1176704.6666666667, ans=0.125 2023-10-12 20:50:24,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-12 20:50:28,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1176751.3333333333, ans=0.1 2023-10-12 20:50:28,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1176751.3333333333, ans=0.0 2023-10-12 20:50:45,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1176844.6666666667, ans=0.0 2023-10-12 20:50:56,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1176891.3333333333, ans=0.125 2023-10-12 20:51:14,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.642e+02 1.821e+02 2.068e+02 2.710e+02, threshold=3.642e+02, percent-clipped=0.0 2023-10-12 20:51:16,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.21 vs. limit=15.0 2023-10-12 20:51:21,784 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=12.0 2023-10-12 20:51:38,376 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-10-12 20:52:13,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.48 vs. limit=22.5 2023-10-12 20:52:38,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1177311.3333333333, ans=0.1 2023-10-12 20:52:44,443 INFO [train.py:1031] (0/4) Epoch 19, batch 6500, loss[loss=0.1794, simple_loss=0.2814, pruned_loss=0.03872, over 16896.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2827, pruned_loss=0.05045, over 31553276.87 frames. ], batch size: 87, lr: 1.83e-03, grad_scale: 32.0 2023-10-12 20:52:47,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1177358.0, ans=0.125 2023-10-12 20:53:10,628 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.818e+02 1.976e+02 2.310e+02 3.595e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-12 20:53:24,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1177498.0, ans=0.125 2023-10-12 20:53:37,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1177498.0, ans=0.1 2023-10-12 20:53:43,353 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-10-12 20:54:03,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1177638.0, ans=0.125 2023-10-12 20:54:12,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1177684.6666666667, ans=0.1 2023-10-12 20:54:22,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1177684.6666666667, ans=0.0 2023-10-12 20:54:26,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1177731.3333333333, ans=0.0 2023-10-12 20:54:31,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.32 vs. limit=10.0 2023-10-12 20:54:45,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1177824.6666666667, ans=0.125 2023-10-12 20:55:05,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1177871.3333333333, ans=0.0 2023-10-12 20:55:07,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1177918.0, ans=0.0 2023-10-12 20:55:08,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.788e+02 1.992e+02 2.194e+02 2.769e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-12 20:55:23,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-10-12 20:55:33,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1178011.3333333333, ans=0.1 2023-10-12 20:55:47,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.66 vs. limit=12.0 2023-10-12 20:55:48,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1178058.0, ans=0.125 2023-10-12 20:56:02,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1178151.3333333333, ans=0.0 2023-10-12 20:56:16,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1178198.0, ans=0.0 2023-10-12 20:56:22,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1178244.6666666667, ans=0.125 2023-10-12 20:56:23,707 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.48 vs. limit=15.0 2023-10-12 20:56:28,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1178244.6666666667, ans=0.125 2023-10-12 20:56:30,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1178244.6666666667, ans=0.2 2023-10-12 20:56:36,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1178291.3333333333, ans=0.125 2023-10-12 20:56:49,149 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 20:56:55,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.751e+02 1.958e+02 2.215e+02 3.116e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-12 20:57:07,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1178431.3333333333, ans=0.125 2023-10-12 20:57:11,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1178431.3333333333, ans=0.0 2023-10-12 20:57:22,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1178478.0, ans=0.0 2023-10-12 20:57:23,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1178478.0, ans=0.04949747468305833 2023-10-12 20:57:41,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1178524.6666666667, ans=0.09899494936611666 2023-10-12 20:58:12,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1178664.6666666667, ans=0.0 2023-10-12 20:58:15,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1178664.6666666667, ans=0.125 2023-10-12 20:58:18,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1178711.3333333333, ans=0.2 2023-10-12 20:58:29,577 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.79 vs. limit=22.5 2023-10-12 20:58:30,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1178758.0, ans=0.125 2023-10-12 20:58:37,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1178758.0, ans=0.0 2023-10-12 20:59:06,435 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.698e+02 1.899e+02 2.103e+02 3.320e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-12 20:59:29,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1178944.6666666667, ans=0.0 2023-10-12 20:59:54,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1179038.0, ans=0.0 2023-10-12 21:00:07,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1179084.6666666667, ans=0.025 2023-10-12 21:00:08,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1179084.6666666667, ans=0.1 2023-10-12 21:00:08,256 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:00:12,354 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.67 vs. limit=15.0 2023-10-12 21:00:13,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1179084.6666666667, ans=0.0 2023-10-12 21:00:13,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1179084.6666666667, ans=0.0 2023-10-12 21:00:14,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1179084.6666666667, ans=0.0 2023-10-12 21:00:14,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1179084.6666666667, ans=0.125 2023-10-12 21:00:18,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1179131.3333333333, ans=0.2 2023-10-12 21:00:49,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1179224.6666666667, ans=0.125 2023-10-12 21:00:53,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1179271.3333333333, ans=0.5 2023-10-12 21:01:05,988 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.664e+02 1.844e+02 1.997e+02 2.539e+02, threshold=3.688e+02, percent-clipped=0.0 2023-10-12 21:01:20,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1179364.6666666667, ans=0.0 2023-10-12 21:01:21,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1179364.6666666667, ans=0.125 2023-10-12 21:01:22,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1179364.6666666667, ans=0.0 2023-10-12 21:01:26,112 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2023-10-12 21:01:26,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1179411.3333333333, ans=0.09899494936611666 2023-10-12 21:01:26,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1179411.3333333333, ans=0.0 2023-10-12 21:01:28,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=1179411.3333333333, ans=0.1 2023-10-12 21:02:11,325 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.49 vs. limit=15.0 2023-10-12 21:02:15,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1179598.0, ans=0.125 2023-10-12 21:02:18,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.87 vs. limit=12.0 2023-10-12 21:02:21,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1179644.6666666667, ans=0.125 2023-10-12 21:02:22,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.10 vs. limit=15.0 2023-10-12 21:02:33,817 INFO [train.py:1031] (0/4) Epoch 19, batch 7000, loss[loss=0.2113, simple_loss=0.2963, pruned_loss=0.06312, over 16816.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2831, pruned_loss=0.05025, over 31859408.17 frames. ], batch size: 175, lr: 1.83e-03, grad_scale: 16.0 2023-10-12 21:02:56,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1179738.0, ans=0.125 2023-10-12 21:03:04,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1179784.6666666667, ans=0.0 2023-10-12 21:03:04,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.792e+02 1.888e+02 2.090e+02 4.225e+02, threshold=3.776e+02, percent-clipped=1.0 2023-10-12 21:03:08,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-10-12 21:03:10,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1179784.6666666667, ans=0.09899494936611666 2023-10-12 21:03:16,069 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-10-12 21:03:16,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1179831.3333333333, ans=0.2 2023-10-12 21:03:19,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1179831.3333333333, ans=0.125 2023-10-12 21:03:31,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1179878.0, ans=0.2 2023-10-12 21:03:38,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1179924.6666666667, ans=0.125 2023-10-12 21:03:55,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1179971.3333333333, ans=0.1 2023-10-12 21:04:12,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1180064.6666666667, ans=0.2 2023-10-12 21:04:19,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1180064.6666666667, ans=0.125 2023-10-12 21:04:19,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1180064.6666666667, ans=0.0 2023-10-12 21:04:25,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1180111.3333333333, ans=0.125 2023-10-12 21:04:35,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1180158.0, ans=10.0 2023-10-12 21:04:38,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=1180158.0, ans=0.95 2023-10-12 21:04:41,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1180158.0, ans=0.125 2023-10-12 21:04:56,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.799e+02 1.960e+02 2.183e+02 2.964e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-12 21:05:15,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1180298.0, ans=0.0 2023-10-12 21:05:28,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1180344.6666666667, ans=0.0 2023-10-12 21:05:34,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.44 vs. limit=15.0 2023-10-12 21:05:40,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1180438.0, ans=0.125 2023-10-12 21:05:55,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1180484.6666666667, ans=0.0 2023-10-12 21:05:57,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1180484.6666666667, ans=0.2 2023-10-12 21:06:01,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1180531.3333333333, ans=0.0 2023-10-12 21:06:02,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1180531.3333333333, ans=0.125 2023-10-12 21:06:11,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1180578.0, ans=0.1 2023-10-12 21:06:11,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1180578.0, ans=0.125 2023-10-12 21:06:13,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1180578.0, ans=0.0 2023-10-12 21:06:19,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1180578.0, ans=0.2 2023-10-12 21:06:22,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1180578.0, ans=0.1 2023-10-12 21:06:35,265 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.07 vs. limit=22.5 2023-10-12 21:06:38,309 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-10-12 21:06:52,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1180671.3333333333, ans=0.0 2023-10-12 21:07:02,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.768e+02 1.907e+02 2.122e+02 2.783e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-12 21:07:13,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.23 vs. limit=15.0 2023-10-12 21:07:35,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1180858.0, ans=0.05 2023-10-12 21:07:37,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1180858.0, ans=0.125 2023-10-12 21:07:48,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1180904.6666666667, ans=0.125 2023-10-12 21:07:49,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.04 vs. limit=15.0 2023-10-12 21:07:54,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1180904.6666666667, ans=0.125 2023-10-12 21:08:45,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1181091.3333333333, ans=0.2 2023-10-12 21:08:46,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.90 vs. limit=10.0 2023-10-12 21:08:58,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1181138.0, ans=0.0 2023-10-12 21:09:08,462 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.740e+02 1.906e+02 2.170e+02 3.636e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-12 21:09:18,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1181231.3333333333, ans=0.035 2023-10-12 21:09:35,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1181278.0, ans=0.125 2023-10-12 21:09:36,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1181278.0, ans=10.0 2023-10-12 21:09:45,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1181324.6666666667, ans=0.0 2023-10-12 21:09:46,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1181324.6666666667, ans=0.1 2023-10-12 21:09:47,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.90 vs. limit=15.0 2023-10-12 21:09:55,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1181371.3333333333, ans=0.1 2023-10-12 21:10:17,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1181464.6666666667, ans=0.125 2023-10-12 21:10:17,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1181464.6666666667, ans=0.2 2023-10-12 21:10:46,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1181604.6666666667, ans=0.125 2023-10-12 21:10:58,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.810e+02 2.042e+02 2.391e+02 4.638e+02, threshold=4.083e+02, percent-clipped=1.0 2023-10-12 21:11:15,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1181698.0, ans=0.2 2023-10-12 21:11:16,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.46 vs. limit=22.5 2023-10-12 21:11:19,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181744.6666666667, ans=0.1 2023-10-12 21:11:36,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.03 vs. limit=22.5 2023-10-12 21:11:39,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181838.0, ans=0.1 2023-10-12 21:11:51,698 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.38 vs. limit=10.0 2023-10-12 21:12:05,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1181931.3333333333, ans=0.125 2023-10-12 21:12:24,597 INFO [train.py:1031] (0/4) Epoch 19, batch 7500, loss[loss=0.1858, simple_loss=0.2712, pruned_loss=0.05021, over 16708.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2829, pruned_loss=0.05039, over 32034473.21 frames. ], batch size: 56, lr: 1.82e-03, grad_scale: 16.0 2023-10-12 21:12:29,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1182024.6666666667, ans=0.1 2023-10-12 21:12:37,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.70 vs. limit=15.0 2023-10-12 21:12:38,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1182071.3333333333, ans=0.1 2023-10-12 21:12:50,319 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.780e+02 1.931e+02 2.112e+02 4.330e+02, threshold=3.863e+02, percent-clipped=1.0 2023-10-12 21:13:23,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.45 vs. limit=15.0 2023-10-12 21:13:27,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1182258.0, ans=0.125 2023-10-12 21:13:36,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1182304.6666666667, ans=0.1 2023-10-12 21:14:16,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1182444.6666666667, ans=0.0 2023-10-12 21:14:18,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1182491.3333333333, ans=0.0 2023-10-12 21:14:40,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1182584.6666666667, ans=0.0 2023-10-12 21:14:44,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.649e+02 1.835e+02 2.025e+02 2.612e+02, threshold=3.669e+02, percent-clipped=0.0 2023-10-12 21:15:12,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1182678.0, ans=0.2 2023-10-12 21:15:18,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1182678.0, ans=0.0 2023-10-12 21:15:25,636 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=15.0 2023-10-12 21:15:38,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-10-12 21:15:41,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1182771.3333333333, ans=0.04949747468305833 2023-10-12 21:15:55,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.12 vs. limit=22.5 2023-10-12 21:16:15,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1182911.3333333333, ans=0.125 2023-10-12 21:16:31,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1182958.0, ans=0.0 2023-10-12 21:16:38,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1183004.6666666667, ans=0.125 2023-10-12 21:16:38,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1183004.6666666667, ans=0.0 2023-10-12 21:16:54,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.730e+02 1.909e+02 2.159e+02 3.204e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-12 21:16:55,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1183051.3333333333, ans=0.125 2023-10-12 21:17:24,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.80 vs. limit=6.0 2023-10-12 21:17:29,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-10-12 21:17:37,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1183238.0, ans=0.125 2023-10-12 21:17:44,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1183238.0, ans=0.0 2023-10-12 21:17:53,476 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.60 vs. limit=12.0 2023-10-12 21:18:02,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1183331.3333333333, ans=10.0 2023-10-12 21:18:18,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1183424.6666666667, ans=0.125 2023-10-12 21:18:20,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.03 vs. limit=15.0 2023-10-12 21:18:23,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1183424.6666666667, ans=0.0 2023-10-12 21:18:25,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1183424.6666666667, ans=0.1 2023-10-12 21:18:25,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1183424.6666666667, ans=0.125 2023-10-12 21:18:29,255 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.10 vs. limit=22.5 2023-10-12 21:18:49,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.877e+02 2.039e+02 2.225e+02 2.841e+02, threshold=4.078e+02, percent-clipped=0.0 2023-10-12 21:18:58,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.99 vs. limit=15.0 2023-10-12 21:19:08,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1183564.6666666667, ans=0.0 2023-10-12 21:19:29,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1183658.0, ans=0.0 2023-10-12 21:19:32,177 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.15 vs. limit=15.0 2023-10-12 21:19:45,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1183704.6666666667, ans=0.125 2023-10-12 21:19:58,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-12 21:20:00,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.84 vs. limit=12.0 2023-10-12 21:20:08,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.10 vs. limit=22.5 2023-10-12 21:20:14,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1183844.6666666667, ans=0.0 2023-10-12 21:20:23,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1183891.3333333333, ans=0.0 2023-10-12 21:20:41,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-12 21:20:43,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.50 vs. limit=15.0 2023-10-12 21:20:49,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.648e+02 1.800e+02 1.959e+02 2.635e+02, threshold=3.600e+02, percent-clipped=0.0 2023-10-12 21:20:58,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1184031.3333333333, ans=0.1 2023-10-12 21:21:03,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1184031.3333333333, ans=0.125 2023-10-12 21:21:13,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1184078.0, ans=10.0 2023-10-12 21:21:20,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1184078.0, ans=0.1 2023-10-12 21:21:24,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1184124.6666666667, ans=0.035 2023-10-12 21:21:31,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1184124.6666666667, ans=0.125 2023-10-12 21:22:11,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1184311.3333333333, ans=0.125 2023-10-12 21:22:15,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1184311.3333333333, ans=0.1 2023-10-12 21:22:18,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1184311.3333333333, ans=0.125 2023-10-12 21:22:21,583 INFO [train.py:1031] (0/4) Epoch 19, batch 8000, loss[loss=0.1876, simple_loss=0.2791, pruned_loss=0.04802, over 16893.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.282, pruned_loss=0.04978, over 32174276.39 frames. ], batch size: 130, lr: 1.82e-03, grad_scale: 32.0 2023-10-12 21:22:23,103 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-12 21:22:39,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1184404.6666666667, ans=0.0 2023-10-12 21:22:46,296 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.333e+02 1.624e+02 1.821e+02 2.030e+02 3.131e+02, threshold=3.642e+02, percent-clipped=0.0 2023-10-12 21:22:49,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1184451.3333333333, ans=0.125 2023-10-12 21:22:54,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1184498.0, ans=0.1 2023-10-12 21:22:55,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-10-12 21:22:58,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1184498.0, ans=0.125 2023-10-12 21:23:50,330 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-10-12 21:23:51,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1184731.3333333333, ans=15.0 2023-10-12 21:23:55,325 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:24:10,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1184824.6666666667, ans=0.2 2023-10-12 21:24:11,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1184824.6666666667, ans=0.125 2023-10-12 21:24:13,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1184824.6666666667, ans=0.05 2023-10-12 21:24:13,068 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:24:19,143 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.58 vs. limit=22.5 2023-10-12 21:24:19,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1184871.3333333333, ans=0.1 2023-10-12 21:24:31,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1184918.0, ans=0.0 2023-10-12 21:24:32,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.700e+02 1.983e+02 2.371e+02 3.151e+02, threshold=3.965e+02, percent-clipped=0.0 2023-10-12 21:24:36,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1184918.0, ans=0.1 2023-10-12 21:25:29,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1185104.6666666667, ans=0.0 2023-10-12 21:26:00,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.37 vs. limit=22.5 2023-10-12 21:26:06,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1185244.6666666667, ans=0.125 2023-10-12 21:26:15,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1185244.6666666667, ans=0.125 2023-10-12 21:26:41,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1185338.0, ans=0.2 2023-10-12 21:26:43,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1185384.6666666667, ans=0.0 2023-10-12 21:26:47,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.666e+02 1.873e+02 2.016e+02 2.756e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-12 21:26:50,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1185384.6666666667, ans=0.0 2023-10-12 21:27:16,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1185478.0, ans=0.0 2023-10-12 21:27:45,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1185618.0, ans=0.0 2023-10-12 21:27:52,797 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-10-12 21:27:57,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1185664.6666666667, ans=0.0 2023-10-12 21:28:10,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1185711.3333333333, ans=0.04949747468305833 2023-10-12 21:28:39,247 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:28:40,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1185804.6666666667, ans=0.95 2023-10-12 21:28:43,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1185851.3333333333, ans=0.5 2023-10-12 21:28:44,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1185851.3333333333, ans=0.125 2023-10-12 21:28:45,693 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:28:46,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.740e+02 1.900e+02 2.254e+02 3.034e+02, threshold=3.800e+02, percent-clipped=0.0 2023-10-12 21:28:50,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1185851.3333333333, ans=0.125 2023-10-12 21:28:54,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1185898.0, ans=0.07 2023-10-12 21:28:59,352 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.34 vs. limit=15.0 2023-10-12 21:29:01,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1185898.0, ans=0.125 2023-10-12 21:29:06,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1185944.6666666667, ans=0.0 2023-10-12 21:29:22,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=1185991.3333333333, ans=15.0 2023-10-12 21:29:28,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1186038.0, ans=0.0 2023-10-12 21:29:29,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1186038.0, ans=0.0 2023-10-12 21:29:34,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.78 vs. limit=15.0 2023-10-12 21:29:49,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1186131.3333333333, ans=0.125 2023-10-12 21:29:49,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1186131.3333333333, ans=0.125 2023-10-12 21:29:56,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1186131.3333333333, ans=0.125 2023-10-12 21:30:01,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1186178.0, ans=0.2 2023-10-12 21:30:07,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1186178.0, ans=0.025 2023-10-12 21:30:41,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.739e+02 1.926e+02 2.125e+02 2.852e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-12 21:30:47,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.46 vs. limit=10.0 2023-10-12 21:31:46,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1186598.0, ans=0.0 2023-10-12 21:31:48,320 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-10-12 21:31:48,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1186598.0, ans=0.0 2023-10-12 21:32:12,934 INFO [train.py:1031] (0/4) Epoch 19, batch 8500, loss[loss=0.2039, simple_loss=0.2904, pruned_loss=0.05867, over 16945.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2824, pruned_loss=0.04982, over 32301639.75 frames. ], batch size: 123, lr: 1.82e-03, grad_scale: 16.0 2023-10-12 21:32:15,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1186691.3333333333, ans=0.125 2023-10-12 21:32:17,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1186691.3333333333, ans=0.125 2023-10-12 21:32:32,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1186738.0, ans=0.125 2023-10-12 21:32:32,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1186738.0, ans=0.0 2023-10-12 21:32:38,425 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.814e+02 1.957e+02 2.179e+02 2.910e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-12 21:32:58,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1186878.0, ans=0.125 2023-10-12 21:33:23,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1186971.3333333333, ans=0.1 2023-10-12 21:33:32,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1186971.3333333333, ans=0.125 2023-10-12 21:33:36,687 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-10-12 21:33:39,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1187018.0, ans=0.125 2023-10-12 21:33:43,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1187018.0, ans=0.2 2023-10-12 21:33:45,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1187064.6666666667, ans=0.2 2023-10-12 21:33:48,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1187064.6666666667, ans=0.0 2023-10-12 21:34:43,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.771e+02 1.970e+02 2.465e+02 3.662e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-12 21:34:57,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1187298.0, ans=0.0 2023-10-12 21:34:58,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1187298.0, ans=0.125 2023-10-12 21:35:00,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1187298.0, ans=0.125 2023-10-12 21:35:12,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1187391.3333333333, ans=0.0 2023-10-12 21:35:28,680 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=12.0 2023-10-12 21:35:33,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.63 vs. limit=6.0 2023-10-12 21:35:45,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1187484.6666666667, ans=0.125 2023-10-12 21:35:46,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1187484.6666666667, ans=0.125 2023-10-12 21:35:57,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1187531.3333333333, ans=0.125 2023-10-12 21:36:00,912 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=12.0 2023-10-12 21:36:06,156 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.00 vs. limit=15.0 2023-10-12 21:36:18,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1187624.6666666667, ans=0.0 2023-10-12 21:36:20,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1187624.6666666667, ans=0.125 2023-10-12 21:36:33,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1187671.3333333333, ans=0.125 2023-10-12 21:36:47,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.704e+02 1.883e+02 2.095e+02 2.945e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-12 21:36:59,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.57 vs. limit=22.5 2023-10-12 21:37:03,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.41 vs. limit=15.0 2023-10-12 21:37:29,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1187904.6666666667, ans=0.0 2023-10-12 21:37:31,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1187904.6666666667, ans=0.1 2023-10-12 21:37:31,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1187904.6666666667, ans=0.1 2023-10-12 21:38:10,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-10-12 21:38:16,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1188044.6666666667, ans=0.125 2023-10-12 21:38:38,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1188138.0, ans=0.125 2023-10-12 21:38:41,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1188184.6666666667, ans=0.1 2023-10-12 21:38:42,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1188184.6666666667, ans=0.125 2023-10-12 21:38:45,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.685e+02 1.965e+02 2.209e+02 3.271e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-12 21:38:50,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1188231.3333333333, ans=0.05 2023-10-12 21:38:58,392 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 21:39:03,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1188278.0, ans=0.125 2023-10-12 21:39:10,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1188278.0, ans=0.0 2023-10-12 21:39:12,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1188324.6666666667, ans=0.2 2023-10-12 21:39:20,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1188324.6666666667, ans=0.1 2023-10-12 21:39:22,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1188371.3333333333, ans=0.0 2023-10-12 21:39:24,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.59 vs. limit=15.0 2023-10-12 21:39:30,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1188371.3333333333, ans=0.0 2023-10-12 21:39:53,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1188464.6666666667, ans=0.0 2023-10-12 21:40:18,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1188604.6666666667, ans=0.0 2023-10-12 21:40:36,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.747e+02 1.952e+02 2.143e+02 2.989e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-12 21:40:50,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1188698.0, ans=0.125 2023-10-12 21:40:54,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.27 vs. limit=10.0 2023-10-12 21:41:29,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1188884.6666666667, ans=0.125 2023-10-12 21:41:43,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1188931.3333333333, ans=0.2 2023-10-12 21:42:05,746 INFO [train.py:1031] (0/4) Epoch 19, batch 9000, loss[loss=0.2229, simple_loss=0.3142, pruned_loss=0.06584, over 16043.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.282, pruned_loss=0.04969, over 32421710.55 frames. ], batch size: 296, lr: 1.82e-03, grad_scale: 8.0 2023-10-12 21:42:17,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1189071.3333333333, ans=0.125 2023-10-12 21:42:30,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-10-12 21:42:33,955 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.814e+02 1.993e+02 2.342e+02 3.365e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-12 21:42:40,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1189164.6666666667, ans=0.125 2023-10-12 21:42:52,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1189211.3333333333, ans=0.125 2023-10-12 21:43:02,643 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-10-12 21:43:05,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1189258.0, ans=0.125 2023-10-12 21:43:06,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1189258.0, ans=0.2 2023-10-12 21:43:13,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1189304.6666666667, ans=10.0 2023-10-12 21:43:18,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-12 21:44:03,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1189491.3333333333, ans=0.2 2023-10-12 21:44:13,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1189538.0, ans=0.0 2023-10-12 21:44:21,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.734e+02 1.874e+02 2.121e+02 2.518e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-12 21:44:35,624 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-10-12 21:44:37,494 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=12.0 2023-10-12 21:44:40,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1189678.0, ans=0.125 2023-10-12 21:44:41,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1189678.0, ans=0.1 2023-10-12 21:44:59,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1189771.3333333333, ans=0.125 2023-10-12 21:45:17,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1189818.0, ans=0.125 2023-10-12 21:45:37,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1189911.3333333333, ans=0.0 2023-10-12 21:45:41,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1189911.3333333333, ans=0.125 2023-10-12 21:46:05,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1190051.3333333333, ans=0.0 2023-10-12 21:46:10,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.782e+02 1.982e+02 2.351e+02 3.767e+02, threshold=3.965e+02, percent-clipped=1.0 2023-10-12 21:46:15,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1190098.0, ans=0.125 2023-10-12 21:46:38,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1190191.3333333333, ans=0.125 2023-10-12 21:46:40,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1190191.3333333333, ans=0.125 2023-10-12 21:46:50,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1190238.0, ans=0.125 2023-10-12 21:46:51,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1190238.0, ans=0.0 2023-10-12 21:46:51,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1190238.0, ans=0.1 2023-10-12 21:47:01,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1190284.6666666667, ans=0.2 2023-10-12 21:47:13,867 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.45 vs. limit=10.0 2023-10-12 21:47:40,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1190471.3333333333, ans=0.5 2023-10-12 21:47:43,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1190471.3333333333, ans=0.125 2023-10-12 21:47:55,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.00 vs. limit=15.0 2023-10-12 21:47:58,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.803e+02 1.973e+02 2.178e+02 3.103e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-12 21:48:05,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1190564.6666666667, ans=0.05 2023-10-12 21:48:39,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1190658.0, ans=0.0 2023-10-12 21:48:42,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1190658.0, ans=0.125 2023-10-12 21:48:45,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1190704.6666666667, ans=0.0 2023-10-12 21:48:57,350 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.39 vs. limit=15.0 2023-10-12 21:48:58,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1190704.6666666667, ans=0.0 2023-10-12 21:49:04,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1190751.3333333333, ans=0.1 2023-10-12 21:49:21,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1190798.0, ans=0.125 2023-10-12 21:49:48,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1190891.3333333333, ans=0.1 2023-10-12 21:49:51,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1190938.0, ans=0.1 2023-10-12 21:50:02,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1190984.6666666667, ans=0.0 2023-10-12 21:50:08,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.786e+02 1.969e+02 2.151e+02 3.515e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-12 21:50:34,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1191078.0, ans=0.0 2023-10-12 21:50:37,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1191124.6666666667, ans=0.0 2023-10-12 21:50:44,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1191124.6666666667, ans=0.0 2023-10-12 21:50:56,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1191171.3333333333, ans=0.0 2023-10-12 21:51:08,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1191218.0, ans=0.125 2023-10-12 21:51:14,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1191264.6666666667, ans=0.0 2023-10-12 21:51:37,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1191311.3333333333, ans=0.125 2023-10-12 21:51:39,873 INFO [train.py:1031] (0/4) Epoch 19, batch 9500, loss[loss=0.202, simple_loss=0.2947, pruned_loss=0.05465, over 15978.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.283, pruned_loss=0.05012, over 32511044.86 frames. ], batch size: 296, lr: 1.82e-03, grad_scale: 8.0 2023-10-12 21:51:45,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1191358.0, ans=0.125 2023-10-12 21:51:47,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1191358.0, ans=0.125 2023-10-12 21:51:49,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1191358.0, ans=0.1 2023-10-12 21:52:09,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.745e+02 1.884e+02 2.005e+02 2.558e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-12 21:52:17,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1191498.0, ans=0.05 2023-10-12 21:52:39,371 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.39 vs. limit=15.0 2023-10-12 21:52:49,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1191638.0, ans=0.1 2023-10-12 21:52:49,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1191638.0, ans=0.125 2023-10-12 21:53:01,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-10-12 21:53:34,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1191824.6666666667, ans=0.125 2023-10-12 21:53:55,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.82 vs. limit=15.0 2023-10-12 21:54:01,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1191918.0, ans=0.125 2023-10-12 21:54:07,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.812e+02 1.991e+02 2.284e+02 3.020e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-12 21:54:17,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1191964.6666666667, ans=0.0 2023-10-12 21:54:25,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1192011.3333333333, ans=0.125 2023-10-12 21:54:44,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1192104.6666666667, ans=0.0 2023-10-12 21:54:57,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1192151.3333333333, ans=0.0 2023-10-12 21:55:02,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.72 vs. limit=22.5 2023-10-12 21:55:08,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1192198.0, ans=0.025 2023-10-12 21:55:15,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192198.0, ans=0.1 2023-10-12 21:55:25,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1192244.6666666667, ans=0.035 2023-10-12 21:55:34,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1192291.3333333333, ans=0.0 2023-10-12 21:55:54,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1192338.0, ans=0.125 2023-10-12 21:55:57,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1192384.6666666667, ans=0.1 2023-10-12 21:56:04,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1192384.6666666667, ans=0.125 2023-10-12 21:56:05,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.832e+02 1.943e+02 2.178e+02 2.882e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-12 21:56:12,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1192431.3333333333, ans=0.0 2023-10-12 21:56:53,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1192618.0, ans=0.0 2023-10-12 21:57:11,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1192664.6666666667, ans=0.125 2023-10-12 21:57:16,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1192664.6666666667, ans=0.0 2023-10-12 21:57:36,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1192758.0, ans=0.125 2023-10-12 21:57:49,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1192804.6666666667, ans=0.125 2023-10-12 21:58:03,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.759e+02 1.949e+02 2.211e+02 3.322e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-12 21:58:05,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1192851.3333333333, ans=0.125 2023-10-12 21:58:26,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1192944.6666666667, ans=0.125 2023-10-12 21:58:28,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1192944.6666666667, ans=0.125 2023-10-12 21:58:30,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1192944.6666666667, ans=0.0 2023-10-12 21:58:34,544 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.54 vs. limit=15.0 2023-10-12 21:58:35,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192991.3333333333, ans=0.1 2023-10-12 21:58:56,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1193084.6666666667, ans=0.0 2023-10-12 21:58:57,082 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-10-12 21:59:17,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1193131.3333333333, ans=0.125 2023-10-12 21:59:18,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1193131.3333333333, ans=0.0 2023-10-12 21:59:19,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1193131.3333333333, ans=0.2 2023-10-12 21:59:46,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1193271.3333333333, ans=0.1 2023-10-12 21:59:54,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1193318.0, ans=0.05 2023-10-12 21:59:58,250 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=12.0 2023-10-12 22:00:03,377 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.787e+02 2.024e+02 2.272e+02 3.095e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-12 22:00:19,935 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:00:27,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1193458.0, ans=0.0 2023-10-12 22:00:38,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1193504.6666666667, ans=0.125 2023-10-12 22:00:45,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1193504.6666666667, ans=0.0 2023-10-12 22:00:53,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1193551.3333333333, ans=0.0 2023-10-12 22:00:55,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1193551.3333333333, ans=0.0 2023-10-12 22:01:20,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1193644.6666666667, ans=0.1 2023-10-12 22:01:23,721 INFO [train.py:1031] (0/4) Epoch 19, batch 10000, loss[loss=0.2122, simple_loss=0.3009, pruned_loss=0.06172, over 16889.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2824, pruned_loss=0.05001, over 32573564.52 frames. ], batch size: 130, lr: 1.81e-03, grad_scale: 32.0 2023-10-12 22:01:27,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1193691.3333333333, ans=0.1 2023-10-12 22:01:43,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=22.5 2023-10-12 22:01:51,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.23 vs. limit=10.0 2023-10-12 22:01:57,711 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.714e+02 1.863e+02 2.060e+02 2.907e+02, threshold=3.726e+02, percent-clipped=0.0 2023-10-12 22:01:58,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.57 vs. limit=22.5 2023-10-12 22:01:59,293 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.40 vs. limit=6.0 2023-10-12 22:02:03,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1193831.3333333333, ans=0.125 2023-10-12 22:02:07,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1193831.3333333333, ans=0.1 2023-10-12 22:02:12,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1193878.0, ans=0.2 2023-10-12 22:02:16,503 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.70 vs. limit=15.0 2023-10-12 22:02:24,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1193924.6666666667, ans=0.0 2023-10-12 22:02:27,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1193924.6666666667, ans=0.125 2023-10-12 22:02:30,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1193924.6666666667, ans=0.125 2023-10-12 22:02:33,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1193971.3333333333, ans=0.0 2023-10-12 22:02:35,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1193971.3333333333, ans=0.125 2023-10-12 22:02:38,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.02 vs. limit=22.5 2023-10-12 22:02:41,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1193971.3333333333, ans=0.2 2023-10-12 22:03:19,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1194111.3333333333, ans=0.0 2023-10-12 22:03:55,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.81 vs. limit=6.0 2023-10-12 22:03:58,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1194251.3333333333, ans=0.0 2023-10-12 22:04:05,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.730e+02 1.894e+02 2.125e+02 4.373e+02, threshold=3.787e+02, percent-clipped=1.0 2023-10-12 22:04:08,161 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:04:14,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1194298.0, ans=0.0 2023-10-12 22:04:24,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1194344.6666666667, ans=0.125 2023-10-12 22:04:34,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1194391.3333333333, ans=0.0 2023-10-12 22:04:34,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1194391.3333333333, ans=0.07 2023-10-12 22:05:00,113 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.25 vs. limit=10.0 2023-10-12 22:05:19,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1194578.0, ans=0.125 2023-10-12 22:05:21,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1194578.0, ans=0.125 2023-10-12 22:05:30,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1194624.6666666667, ans=0.125 2023-10-12 22:05:40,517 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-10-12 22:05:41,367 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-256000.pt 2023-10-12 22:05:58,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1194671.3333333333, ans=10.0 2023-10-12 22:06:11,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1194718.0, ans=0.0 2023-10-12 22:06:12,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.750e+02 1.894e+02 2.090e+02 2.837e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-12 22:06:33,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1194811.3333333333, ans=0.125 2023-10-12 22:07:00,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1194904.6666666667, ans=0.2 2023-10-12 22:07:08,361 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.72 vs. limit=15.0 2023-10-12 22:07:47,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.83 vs. limit=15.0 2023-10-12 22:08:03,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1195138.0, ans=0.025 2023-10-12 22:08:20,941 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.772e+02 1.928e+02 2.175e+02 2.736e+02, threshold=3.855e+02, percent-clipped=0.0 2023-10-12 22:08:22,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1195231.3333333333, ans=0.5 2023-10-12 22:08:25,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1195231.3333333333, ans=0.125 2023-10-12 22:08:28,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1195231.3333333333, ans=0.0 2023-10-12 22:08:47,163 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-10-12 22:08:56,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1195324.6666666667, ans=0.125 2023-10-12 22:09:57,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1195558.0, ans=0.125 2023-10-12 22:10:08,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1195604.6666666667, ans=0.0 2023-10-12 22:10:09,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1195604.6666666667, ans=10.0 2023-10-12 22:10:11,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1195651.3333333333, ans=0.125 2023-10-12 22:10:23,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.759e+02 1.875e+02 2.077e+02 2.636e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-12 22:10:44,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1195744.6666666667, ans=0.125 2023-10-12 22:10:44,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1195744.6666666667, ans=0.125 2023-10-12 22:10:46,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1195744.6666666667, ans=0.125 2023-10-12 22:10:59,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1195791.3333333333, ans=0.02 2023-10-12 22:11:02,082 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:11:21,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1195884.6666666667, ans=0.0 2023-10-12 22:11:36,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1195931.3333333333, ans=0.125 2023-10-12 22:11:38,833 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-10-12 22:11:52,247 INFO [train.py:1031] (0/4) Epoch 19, batch 10500, loss[loss=0.201, simple_loss=0.292, pruned_loss=0.05496, over 16872.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2826, pruned_loss=0.04999, over 32595658.45 frames. ], batch size: 110, lr: 1.81e-03, grad_scale: 16.0 2023-10-12 22:11:54,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1196024.6666666667, ans=0.2 2023-10-12 22:12:37,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.776e+02 1.918e+02 2.187e+02 2.697e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-12 22:12:37,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1196164.6666666667, ans=0.0 2023-10-12 22:12:38,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1196164.6666666667, ans=0.09899494936611666 2023-10-12 22:12:46,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-12 22:12:53,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1196211.3333333333, ans=0.125 2023-10-12 22:12:57,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1196211.3333333333, ans=0.1 2023-10-12 22:13:14,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1196304.6666666667, ans=0.1 2023-10-12 22:13:15,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1196304.6666666667, ans=0.0 2023-10-12 22:13:34,723 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:13:35,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-10-12 22:13:56,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1196398.0, ans=0.125 2023-10-12 22:13:59,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1196398.0, ans=0.1 2023-10-12 22:14:02,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1196444.6666666667, ans=0.2 2023-10-12 22:14:11,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1196444.6666666667, ans=0.0 2023-10-12 22:14:12,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1196444.6666666667, ans=0.025 2023-10-12 22:14:28,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1196538.0, ans=0.2 2023-10-12 22:14:34,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1196538.0, ans=0.1 2023-10-12 22:14:51,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.787e+02 1.984e+02 2.107e+02 3.024e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-12 22:14:55,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1196631.3333333333, ans=0.0 2023-10-12 22:15:16,242 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=22.5 2023-10-12 22:15:38,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1196818.0, ans=0.125 2023-10-12 22:15:51,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1196818.0, ans=0.125 2023-10-12 22:16:15,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1196911.3333333333, ans=0.125 2023-10-12 22:16:24,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1196958.0, ans=0.1 2023-10-12 22:16:33,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1197004.6666666667, ans=0.125 2023-10-12 22:16:53,803 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.888e+02 2.130e+02 2.470e+02 3.787e+02, threshold=4.261e+02, percent-clipped=0.0 2023-10-12 22:17:03,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-10-12 22:17:04,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1197098.0, ans=0.125 2023-10-12 22:17:32,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1197238.0, ans=0.0 2023-10-12 22:17:34,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1197238.0, ans=0.125 2023-10-12 22:17:52,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1197284.6666666667, ans=0.125 2023-10-12 22:18:12,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1197378.0, ans=0.125 2023-10-12 22:18:16,564 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-10-12 22:18:44,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1197518.0, ans=0.1 2023-10-12 22:18:49,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.842e+02 1.998e+02 2.167e+02 2.763e+02, threshold=3.996e+02, percent-clipped=0.0 2023-10-12 22:19:06,032 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-10-12 22:19:18,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1197658.0, ans=0.125 2023-10-12 22:19:32,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1197704.6666666667, ans=0.125 2023-10-12 22:19:37,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1197704.6666666667, ans=0.025 2023-10-12 22:19:49,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1197751.3333333333, ans=0.0 2023-10-12 22:19:58,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1197798.0, ans=0.0 2023-10-12 22:20:05,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.98 vs. limit=15.0 2023-10-12 22:20:31,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1197938.0, ans=0.125 2023-10-12 22:20:44,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1197984.6666666667, ans=10.0 2023-10-12 22:20:54,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.639e+02 1.787e+02 1.929e+02 2.792e+02, threshold=3.573e+02, percent-clipped=0.0 2023-10-12 22:20:58,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1198031.3333333333, ans=0.1 2023-10-12 22:21:02,860 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.70 vs. limit=15.0 2023-10-12 22:21:12,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1198078.0, ans=0.2 2023-10-12 22:21:28,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1198124.6666666667, ans=10.0 2023-10-12 22:21:45,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1198218.0, ans=0.125 2023-10-12 22:21:53,845 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-10-12 22:22:04,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1198311.3333333333, ans=0.125 2023-10-12 22:22:15,132 INFO [train.py:1031] (0/4) Epoch 19, batch 11000, loss[loss=0.2117, simple_loss=0.3056, pruned_loss=0.05885, over 16821.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2827, pruned_loss=0.05009, over 32638013.65 frames. ], batch size: 146, lr: 1.81e-03, grad_scale: 16.0 2023-10-12 22:22:19,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1198358.0, ans=0.125 2023-10-12 22:22:24,904 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=15.0 2023-10-12 22:22:27,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1198404.6666666667, ans=0.0 2023-10-12 22:22:36,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1198451.3333333333, ans=0.0 2023-10-12 22:22:50,426 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.817e+02 1.986e+02 2.225e+02 2.965e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-12 22:22:53,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1198498.0, ans=0.125 2023-10-12 22:22:54,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1198498.0, ans=0.125 2023-10-12 22:22:57,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1198498.0, ans=0.125 2023-10-12 22:22:58,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1198498.0, ans=0.125 2023-10-12 22:22:59,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1198498.0, ans=0.125 2023-10-12 22:23:03,346 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.41 vs. limit=10.0 2023-10-12 22:23:21,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1198591.3333333333, ans=0.1 2023-10-12 22:23:37,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.88 vs. limit=15.0 2023-10-12 22:23:38,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1198638.0, ans=0.125 2023-10-12 22:23:40,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1198684.6666666667, ans=0.125 2023-10-12 22:23:47,955 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.05 vs. limit=15.0 2023-10-12 22:23:58,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1198731.3333333333, ans=0.125 2023-10-12 22:24:24,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1198824.6666666667, ans=0.2 2023-10-12 22:24:49,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1198918.0, ans=0.09899494936611666 2023-10-12 22:24:54,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1198918.0, ans=0.125 2023-10-12 22:24:58,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.670e+02 1.835e+02 2.100e+02 2.698e+02, threshold=3.670e+02, percent-clipped=0.0 2023-10-12 22:25:08,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1198964.6666666667, ans=0.2 2023-10-12 22:25:10,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-10-12 22:26:25,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1199198.0, ans=0.125 2023-10-12 22:26:57,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1199244.6666666667, ans=0.125 2023-10-12 22:27:09,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1199291.3333333333, ans=0.2 2023-10-12 22:27:10,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1199291.3333333333, ans=0.125 2023-10-12 22:27:10,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.10 vs. limit=12.0 2023-10-12 22:27:23,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1199384.6666666667, ans=0.0 2023-10-12 22:27:32,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=12.0 2023-10-12 22:27:38,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.738e+02 1.859e+02 1.998e+02 2.829e+02, threshold=3.717e+02, percent-clipped=0.0 2023-10-12 22:27:47,859 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.30 vs. limit=15.0 2023-10-12 22:27:58,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1199524.6666666667, ans=0.07 2023-10-12 22:28:06,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1199524.6666666667, ans=0.125 2023-10-12 22:28:14,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1199571.3333333333, ans=0.2 2023-10-12 22:28:21,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1199571.3333333333, ans=0.125 2023-10-12 22:28:26,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.96 vs. limit=15.0 2023-10-12 22:28:38,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1199618.0, ans=0.0 2023-10-12 22:28:43,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1199664.6666666667, ans=0.0 2023-10-12 22:28:51,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1199664.6666666667, ans=0.125 2023-10-12 22:29:38,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.36 vs. limit=10.0 2023-10-12 22:29:42,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1199758.0, ans=0.2 2023-10-12 22:29:48,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1199804.6666666667, ans=0.0 2023-10-12 22:29:51,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199804.6666666667, ans=0.1 2023-10-12 22:29:54,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1199804.6666666667, ans=0.2 2023-10-12 22:30:09,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1199851.3333333333, ans=0.125 2023-10-12 22:30:15,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.680e+02 1.903e+02 2.210e+02 3.202e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-12 22:30:39,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1199944.6666666667, ans=0.0 2023-10-12 22:30:40,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1199991.3333333333, ans=0.125 2023-10-12 22:30:45,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199991.3333333333, ans=0.1 2023-10-12 22:30:54,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.24 vs. limit=15.0 2023-10-12 22:30:56,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1200038.0, ans=0.1 2023-10-12 22:31:09,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1200084.6666666667, ans=0.125 2023-10-12 22:31:32,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1200178.0, ans=0.0 2023-10-12 22:31:38,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1200178.0, ans=0.125 2023-10-12 22:31:48,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1200224.6666666667, ans=0.125 2023-10-12 22:31:56,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1200271.3333333333, ans=0.125 2023-10-12 22:31:59,895 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.93 vs. limit=10.0 2023-10-12 22:32:04,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1200271.3333333333, ans=0.0 2023-10-12 22:32:11,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1200318.0, ans=0.1 2023-10-12 22:32:16,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1200318.0, ans=0.0 2023-10-12 22:32:17,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1200318.0, ans=0.125 2023-10-12 22:32:21,959 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.82 vs. limit=15.0 2023-10-12 22:32:22,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.847e+02 2.052e+02 2.155e+02 3.101e+02, threshold=4.103e+02, percent-clipped=0.0 2023-10-12 22:32:45,158 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.02 vs. limit=15.0 2023-10-12 22:33:05,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1200504.6666666667, ans=0.0 2023-10-12 22:33:10,114 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-10-12 22:33:39,964 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.52 vs. limit=15.0 2023-10-12 22:33:42,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1200644.6666666667, ans=0.1 2023-10-12 22:33:45,930 INFO [train.py:1031] (0/4) Epoch 19, batch 11500, loss[loss=0.1812, simple_loss=0.2833, pruned_loss=0.03959, over 16865.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2824, pruned_loss=0.04994, over 32655645.85 frames. ], batch size: 146, lr: 1.81e-03, grad_scale: 16.0 2023-10-12 22:34:01,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-10-12 22:34:06,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1200784.6666666667, ans=0.125 2023-10-12 22:34:15,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.74 vs. limit=12.0 2023-10-12 22:34:20,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.794e+02 1.951e+02 2.213e+02 4.157e+02, threshold=3.902e+02, percent-clipped=1.0 2023-10-12 22:34:20,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1200831.3333333333, ans=10.0 2023-10-12 22:34:30,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1200878.0, ans=0.0 2023-10-12 22:34:33,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.43 vs. limit=15.0 2023-10-12 22:34:46,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1200924.6666666667, ans=0.1 2023-10-12 22:34:56,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1200971.3333333333, ans=0.2 2023-10-12 22:35:27,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1201064.6666666667, ans=0.0 2023-10-12 22:35:36,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1201111.3333333333, ans=10.0 2023-10-12 22:35:38,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1201111.3333333333, ans=0.125 2023-10-12 22:35:59,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1201158.0, ans=0.1 2023-10-12 22:36:19,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=15.0 2023-10-12 22:36:21,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1201251.3333333333, ans=0.2 2023-10-12 22:36:22,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1201298.0, ans=0.2 2023-10-12 22:36:24,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.738e+02 1.841e+02 2.038e+02 2.691e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-12 22:36:34,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1201344.6666666667, ans=0.125 2023-10-12 22:36:48,671 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.54 vs. limit=15.0 2023-10-12 22:36:52,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1201391.3333333333, ans=0.125 2023-10-12 22:37:09,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1201484.6666666667, ans=0.125 2023-10-12 22:37:14,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1201484.6666666667, ans=10.0 2023-10-12 22:37:18,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1201484.6666666667, ans=0.1 2023-10-12 22:37:18,517 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.38 vs. limit=15.0 2023-10-12 22:37:34,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1201578.0, ans=0.0 2023-10-12 22:38:16,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.717e+02 1.884e+02 2.129e+02 3.304e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-12 22:38:18,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1201764.6666666667, ans=0.0 2023-10-12 22:38:20,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1201764.6666666667, ans=0.125 2023-10-12 22:38:44,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1201858.0, ans=0.125 2023-10-12 22:38:46,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1201858.0, ans=0.0 2023-10-12 22:39:21,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1201951.3333333333, ans=0.0 2023-10-12 22:39:34,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1201998.0, ans=0.125 2023-10-12 22:39:42,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1202044.6666666667, ans=0.125 2023-10-12 22:39:44,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1202044.6666666667, ans=0.025 2023-10-12 22:39:45,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1202044.6666666667, ans=0.2 2023-10-12 22:40:15,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.94 vs. limit=10.0 2023-10-12 22:40:16,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1202184.6666666667, ans=0.2 2023-10-12 22:40:17,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1202184.6666666667, ans=0.0 2023-10-12 22:40:21,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1202184.6666666667, ans=0.0 2023-10-12 22:40:29,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.27 vs. limit=15.0 2023-10-12 22:40:32,954 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.694e+02 1.841e+02 2.030e+02 2.551e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-12 22:40:35,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.87 vs. limit=22.5 2023-10-12 22:40:40,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1202231.3333333333, ans=0.035 2023-10-12 22:40:53,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1202278.0, ans=0.0 2023-10-12 22:41:00,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1202324.6666666667, ans=0.125 2023-10-12 22:41:42,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1202464.6666666667, ans=0.07 2023-10-12 22:41:56,890 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.46 vs. limit=6.0 2023-10-12 22:41:58,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1202558.0, ans=0.2 2023-10-12 22:42:09,580 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-10-12 22:42:21,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1202651.3333333333, ans=0.0 2023-10-12 22:42:33,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.759e+02 1.919e+02 2.116e+02 3.023e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-12 22:42:48,424 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.63 vs. limit=5.0 2023-10-12 22:42:50,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1202744.6666666667, ans=0.1 2023-10-12 22:42:54,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1202744.6666666667, ans=0.125 2023-10-12 22:43:13,658 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:43:17,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1202838.0, ans=0.125 2023-10-12 22:43:20,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1202884.6666666667, ans=0.125 2023-10-12 22:43:24,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1202884.6666666667, ans=0.125 2023-10-12 22:43:28,746 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:43:55,768 INFO [train.py:1031] (0/4) Epoch 19, batch 12000, loss[loss=0.1919, simple_loss=0.2814, pruned_loss=0.05116, over 16886.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2825, pruned_loss=0.0497, over 32693392.18 frames. ], batch size: 72, lr: 1.81e-03, grad_scale: 32.0 2023-10-12 22:43:56,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1203024.6666666667, ans=0.125 2023-10-12 22:44:15,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1203071.3333333333, ans=0.125 2023-10-12 22:44:34,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.783e+02 1.956e+02 2.155e+02 3.074e+02, threshold=3.911e+02, percent-clipped=0.0 2023-10-12 22:45:04,093 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:45:17,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1203304.6666666667, ans=0.125 2023-10-12 22:45:19,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.68 vs. limit=15.0 2023-10-12 22:45:26,445 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.43 vs. limit=22.5 2023-10-12 22:45:39,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1203398.0, ans=0.0 2023-10-12 22:45:39,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1203398.0, ans=0.0 2023-10-12 22:45:41,051 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.10 vs. limit=15.0 2023-10-12 22:45:45,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1203444.6666666667, ans=0.125 2023-10-12 22:45:55,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1203491.3333333333, ans=10.0 2023-10-12 22:46:08,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1203538.0, ans=0.125 2023-10-12 22:46:08,519 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=22.5 2023-10-12 22:46:15,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1203584.6666666667, ans=0.0 2023-10-12 22:46:18,418 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.58 vs. limit=15.0 2023-10-12 22:46:23,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1203631.3333333333, ans=0.1 2023-10-12 22:46:26,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.678e+02 1.808e+02 2.090e+02 3.027e+02, threshold=3.617e+02, percent-clipped=0.0 2023-10-12 22:46:34,247 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:46:37,253 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.09 vs. limit=10.0 2023-10-12 22:46:45,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1203724.6666666667, ans=0.125 2023-10-12 22:46:51,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1203724.6666666667, ans=0.1 2023-10-12 22:47:14,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.69 vs. limit=15.0 2023-10-12 22:47:36,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1203864.6666666667, ans=0.125 2023-10-12 22:47:45,694 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.57 vs. limit=15.0 2023-10-12 22:47:48,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1203911.3333333333, ans=10.0 2023-10-12 22:47:49,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1203958.0, ans=0.125 2023-10-12 22:47:50,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1203958.0, ans=0.125 2023-10-12 22:47:59,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1203958.0, ans=0.125 2023-10-12 22:48:00,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1203958.0, ans=0.0 2023-10-12 22:48:27,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.806e+02 1.996e+02 2.205e+02 3.319e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-12 22:48:31,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1204098.0, ans=0.5 2023-10-12 22:48:33,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1204098.0, ans=0.1 2023-10-12 22:48:50,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1204191.3333333333, ans=0.125 2023-10-12 22:49:11,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1204284.6666666667, ans=0.125 2023-10-12 22:49:12,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1204284.6666666667, ans=0.0 2023-10-12 22:50:08,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1204471.3333333333, ans=0.2 2023-10-12 22:50:09,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-10-12 22:50:17,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1204518.0, ans=0.1 2023-10-12 22:50:24,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.748e+02 1.987e+02 2.168e+02 2.757e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-12 22:51:47,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1204891.3333333333, ans=0.0 2023-10-12 22:51:59,338 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.99 vs. limit=22.5 2023-10-12 22:52:07,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1204938.0, ans=0.0 2023-10-12 22:52:11,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.67 vs. limit=15.0 2023-10-12 22:52:26,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1205031.3333333333, ans=0.0 2023-10-12 22:52:28,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.765e+02 1.914e+02 2.120e+02 2.910e+02, threshold=3.828e+02, percent-clipped=0.0 2023-10-12 22:52:30,687 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.19 vs. limit=15.0 2023-10-12 22:52:41,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1205078.0, ans=0.2 2023-10-12 22:52:56,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=1205124.6666666667, ans=15.0 2023-10-12 22:53:21,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1205218.0, ans=0.035 2023-10-12 22:53:34,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-10-12 22:53:51,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1205358.0, ans=0.1 2023-10-12 22:53:53,411 INFO [train.py:1031] (0/4) Epoch 19, batch 12500, loss[loss=0.1801, simple_loss=0.2777, pruned_loss=0.04127, over 16871.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.282, pruned_loss=0.0496, over 32692131.95 frames. ], batch size: 87, lr: 1.81e-03, grad_scale: 32.0 2023-10-12 22:54:04,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1205404.6666666667, ans=0.125 2023-10-12 22:54:14,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1205404.6666666667, ans=0.125 2023-10-12 22:54:32,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.730e+02 1.873e+02 2.080e+02 3.168e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-12 22:54:41,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1205544.6666666667, ans=0.1 2023-10-12 22:55:04,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.84 vs. limit=22.5 2023-10-12 22:55:06,978 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-10-12 22:55:30,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1205731.3333333333, ans=0.125 2023-10-12 22:55:46,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1205778.0, ans=0.035 2023-10-12 22:55:55,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1205824.6666666667, ans=0.0 2023-10-12 22:56:00,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1205824.6666666667, ans=0.1 2023-10-12 22:56:01,724 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.93 vs. limit=22.5 2023-10-12 22:56:02,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1205871.3333333333, ans=0.1 2023-10-12 22:56:04,642 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.32 vs. limit=15.0 2023-10-12 22:56:11,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1205871.3333333333, ans=0.0 2023-10-12 22:56:26,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1205918.0, ans=0.1 2023-10-12 22:56:31,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1205964.6666666667, ans=0.125 2023-10-12 22:56:32,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.755e+02 1.874e+02 2.175e+02 2.894e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-12 22:56:47,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1206011.3333333333, ans=0.95 2023-10-12 22:57:02,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1206058.0, ans=0.0 2023-10-12 22:57:15,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1206104.6666666667, ans=0.125 2023-10-12 22:57:43,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1206244.6666666667, ans=0.125 2023-10-12 22:57:44,291 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 22:57:59,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1206291.3333333333, ans=0.2 2023-10-12 22:58:17,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1206384.6666666667, ans=0.05 2023-10-12 22:58:23,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1206384.6666666667, ans=0.035 2023-10-12 22:58:27,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1206384.6666666667, ans=0.1 2023-10-12 22:58:29,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1206431.3333333333, ans=0.2 2023-10-12 22:58:32,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.751e+02 1.923e+02 2.097e+02 3.428e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-12 22:58:36,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1206431.3333333333, ans=0.1 2023-10-12 22:58:43,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.59 vs. limit=15.0 2023-10-12 22:58:58,485 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=15.0 2023-10-12 22:59:38,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1206711.3333333333, ans=0.2 2023-10-12 23:00:27,601 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.893e+02 2.185e+02 2.410e+02 3.248e+02, threshold=4.370e+02, percent-clipped=0.0 2023-10-12 23:00:32,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1206898.0, ans=0.125 2023-10-12 23:01:01,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1207038.0, ans=0.2 2023-10-12 23:01:01,582 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.55 vs. limit=12.0 2023-10-12 23:01:10,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1207038.0, ans=0.2 2023-10-12 23:01:33,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1207131.3333333333, ans=0.1 2023-10-12 23:01:52,044 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.89 vs. limit=15.0 2023-10-12 23:01:56,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1207224.6666666667, ans=0.0 2023-10-12 23:02:09,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.42 vs. limit=15.0 2023-10-12 23:02:23,374 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.95 vs. limit=15.0 2023-10-12 23:02:27,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.652e+02 1.853e+02 2.126e+02 3.167e+02, threshold=3.705e+02, percent-clipped=0.0 2023-10-12 23:02:31,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1207364.6666666667, ans=0.0 2023-10-12 23:03:15,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1207551.3333333333, ans=0.125 2023-10-12 23:03:26,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1207598.0, ans=0.0 2023-10-12 23:03:47,177 INFO [train.py:1031] (0/4) Epoch 19, batch 13000, loss[loss=0.1883, simple_loss=0.2764, pruned_loss=0.05009, over 15758.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2829, pruned_loss=0.04995, over 32720270.60 frames. ], batch size: 36, lr: 1.80e-03, grad_scale: 32.0 2023-10-12 23:03:49,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.63 vs. limit=15.0 2023-10-12 23:03:50,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1207691.3333333333, ans=0.0 2023-10-12 23:03:56,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.15 vs. limit=6.0 2023-10-12 23:04:02,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1207738.0, ans=0.125 2023-10-12 23:04:04,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1207738.0, ans=0.125 2023-10-12 23:04:05,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1207738.0, ans=0.2 2023-10-12 23:04:12,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1207738.0, ans=0.05 2023-10-12 23:04:23,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.99 vs. limit=22.5 2023-10-12 23:04:34,940 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.765e+02 1.951e+02 2.198e+02 2.914e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-12 23:04:35,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1207831.3333333333, ans=0.02 2023-10-12 23:04:47,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1207878.0, ans=0.125 2023-10-12 23:04:47,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1207878.0, ans=0.1 2023-10-12 23:04:50,473 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-10-12 23:04:55,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1207878.0, ans=0.1 2023-10-12 23:05:17,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1207971.3333333333, ans=0.0 2023-10-12 23:05:31,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1208018.0, ans=0.125 2023-10-12 23:06:41,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.614e+02 1.778e+02 1.975e+02 2.768e+02, threshold=3.557e+02, percent-clipped=0.0 2023-10-12 23:06:42,785 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.98 vs. limit=15.0 2023-10-12 23:06:53,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1208344.6666666667, ans=0.125 2023-10-12 23:06:56,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1208344.6666666667, ans=0.125 2023-10-12 23:07:08,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1208391.3333333333, ans=0.125 2023-10-12 23:07:11,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1208391.3333333333, ans=0.125 2023-10-12 23:07:23,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1208438.0, ans=0.0 2023-10-12 23:07:32,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208484.6666666667, ans=0.1 2023-10-12 23:07:37,211 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:07:44,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1208531.3333333333, ans=10.0 2023-10-12 23:07:52,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1208531.3333333333, ans=0.125 2023-10-12 23:08:06,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208578.0, ans=0.1 2023-10-12 23:08:13,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-10-12 23:08:21,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1208671.3333333333, ans=0.0 2023-10-12 23:08:30,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1208671.3333333333, ans=0.125 2023-10-12 23:08:41,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1208718.0, ans=0.0 2023-10-12 23:08:41,895 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-10-12 23:08:48,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.698e+02 1.854e+02 2.054e+02 2.727e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 23:09:09,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1208858.0, ans=0.125 2023-10-12 23:09:09,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1208858.0, ans=0.2 2023-10-12 23:09:21,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1208904.6666666667, ans=0.125 2023-10-12 23:09:22,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1208904.6666666667, ans=0.125 2023-10-12 23:09:23,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1208904.6666666667, ans=0.2 2023-10-12 23:09:23,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1208904.6666666667, ans=0.125 2023-10-12 23:09:23,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1208904.6666666667, ans=0.5 2023-10-12 23:09:27,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1208904.6666666667, ans=0.2 2023-10-12 23:09:28,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1208904.6666666667, ans=0.125 2023-10-12 23:10:08,798 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:10:43,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1209184.6666666667, ans=0.0 2023-10-12 23:10:44,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1209184.6666666667, ans=0.035 2023-10-12 23:10:47,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1209231.3333333333, ans=0.125 2023-10-12 23:10:49,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.751e+02 1.882e+02 2.089e+02 2.854e+02, threshold=3.763e+02, percent-clipped=0.0 2023-10-12 23:10:51,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1209231.3333333333, ans=0.2 2023-10-12 23:11:09,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1209324.6666666667, ans=0.125 2023-10-12 23:11:12,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1209324.6666666667, ans=0.0 2023-10-12 23:11:16,525 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.45 vs. limit=15.0 2023-10-12 23:11:43,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1209464.6666666667, ans=0.125 2023-10-12 23:11:45,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1209464.6666666667, ans=0.125 2023-10-12 23:11:51,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209464.6666666667, ans=0.1 2023-10-12 23:11:58,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1209511.3333333333, ans=0.2 2023-10-12 23:12:19,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1209604.6666666667, ans=0.1 2023-10-12 23:12:34,121 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=6.0 2023-10-12 23:12:37,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.62 vs. limit=22.5 2023-10-12 23:12:42,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1209698.0, ans=0.125 2023-10-12 23:12:47,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.764e+02 1.919e+02 2.102e+02 2.912e+02, threshold=3.837e+02, percent-clipped=0.0 2023-10-12 23:12:53,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1209698.0, ans=0.125 2023-10-12 23:12:55,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1209744.6666666667, ans=0.125 2023-10-12 23:13:02,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1209744.6666666667, ans=0.125 2023-10-12 23:13:13,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1209791.3333333333, ans=0.0 2023-10-12 23:13:13,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1209791.3333333333, ans=0.125 2023-10-12 23:13:22,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1209838.0, ans=0.0 2023-10-12 23:13:46,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1209931.3333333333, ans=0.125 2023-10-12 23:13:46,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.91 vs. limit=15.0 2023-10-12 23:13:52,109 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-10-12 23:14:02,014 INFO [train.py:1031] (0/4) Epoch 19, batch 13500, loss[loss=0.1982, simple_loss=0.289, pruned_loss=0.05364, over 16811.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2823, pruned_loss=0.04989, over 32697371.80 frames. ], batch size: 175, lr: 1.80e-03, grad_scale: 16.0 2023-10-12 23:14:12,741 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-10-12 23:14:21,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1210071.3333333333, ans=0.125 2023-10-12 23:14:35,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.86 vs. limit=22.5 2023-10-12 23:14:43,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.770e+02 1.932e+02 2.121e+02 2.751e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-12 23:14:48,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1210164.6666666667, ans=0.125 2023-10-12 23:15:16,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1210304.6666666667, ans=0.05 2023-10-12 23:15:17,388 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=15.0 2023-10-12 23:15:31,608 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2023-10-12 23:15:38,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1210398.0, ans=0.0 2023-10-12 23:15:40,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1210398.0, ans=0.125 2023-10-12 23:16:09,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1210491.3333333333, ans=0.125 2023-10-12 23:16:22,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1210538.0, ans=0.0 2023-10-12 23:16:25,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1210584.6666666667, ans=0.1 2023-10-12 23:16:30,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1210584.6666666667, ans=0.125 2023-10-12 23:16:36,693 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-10-12 23:16:37,058 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.772e+02 1.955e+02 2.136e+02 2.905e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-12 23:16:53,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=1210724.6666666667, ans=0.95 2023-10-12 23:16:53,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1210724.6666666667, ans=0.125 2023-10-12 23:16:59,047 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-19.pt 2023-10-12 23:17:41,170 INFO [train.py:1031] (0/4) Epoch 20, batch 0, loss[loss=0.1674, simple_loss=0.2601, pruned_loss=0.03732, over 16879.00 frames. ], tot_loss[loss=0.1674, simple_loss=0.2601, pruned_loss=0.03732, over 16879.00 frames. ], batch size: 82, lr: 1.75e-03, grad_scale: 32.0 2023-10-12 23:17:41,171 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-12 23:17:50,662 INFO [train.py:1063] (0/4) Epoch 20, validation: loss=0.2148, simple_loss=0.3012, pruned_loss=0.06418, over 1020973.00 frames. 2023-10-12 23:17:50,663 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-12 23:17:57,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1210748.0, ans=0.125 2023-10-12 23:18:06,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1210794.6666666667, ans=0.0 2023-10-12 23:18:08,823 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.65 vs. limit=15.0 2023-10-12 23:18:41,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1210888.0, ans=0.0 2023-10-12 23:18:45,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1210934.6666666667, ans=0.04949747468305833 2023-10-12 23:18:48,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1210934.6666666667, ans=0.125 2023-10-12 23:18:51,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1210934.6666666667, ans=0.125 2023-10-12 23:19:06,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1210981.3333333333, ans=0.125 2023-10-12 23:19:12,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1211028.0, ans=0.125 2023-10-12 23:19:22,398 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.78 vs. limit=10.0 2023-10-12 23:19:32,522 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.696e+02 1.854e+02 2.053e+02 3.223e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-12 23:19:46,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1211168.0, ans=0.125 2023-10-12 23:19:48,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1211168.0, ans=0.1 2023-10-12 23:20:01,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1211214.6666666667, ans=0.2 2023-10-12 23:20:12,554 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-10-12 23:20:14,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1211261.3333333333, ans=0.07 2023-10-12 23:20:20,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1211308.0, ans=0.2 2023-10-12 23:20:26,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1211308.0, ans=0.0 2023-10-12 23:20:41,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1211354.6666666667, ans=0.125 2023-10-12 23:20:45,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1211401.3333333333, ans=0.125 2023-10-12 23:20:47,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1211401.3333333333, ans=0.125 2023-10-12 23:20:57,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1211448.0, ans=0.125 2023-10-12 23:21:04,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1211448.0, ans=0.125 2023-10-12 23:21:31,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.700e+02 1.867e+02 2.075e+02 2.646e+02, threshold=3.733e+02, percent-clipped=0.0 2023-10-12 23:21:52,287 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:22:29,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1211821.3333333333, ans=0.09899494936611666 2023-10-12 23:22:31,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1211821.3333333333, ans=0.1 2023-10-12 23:22:36,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1211868.0, ans=0.125 2023-10-12 23:22:50,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.90 vs. limit=8.0 2023-10-12 23:22:52,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=22.5 2023-10-12 23:22:58,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.22 vs. limit=15.0 2023-10-12 23:23:28,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.729e+02 1.884e+02 2.111e+02 2.882e+02, threshold=3.767e+02, percent-clipped=0.0 2023-10-12 23:23:28,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1212054.6666666667, ans=0.125 2023-10-12 23:23:33,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1212054.6666666667, ans=0.0 2023-10-12 23:23:38,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1212054.6666666667, ans=0.2 2023-10-12 23:23:57,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1212148.0, ans=0.125 2023-10-12 23:24:37,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.94 vs. limit=15.0 2023-10-12 23:25:23,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1212521.3333333333, ans=0.125 2023-10-12 23:25:23,708 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.770e+02 1.959e+02 2.288e+02 3.198e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-12 23:25:36,689 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-10-12 23:25:37,775 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-10-12 23:25:39,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1212568.0, ans=0.2 2023-10-12 23:26:16,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1212708.0, ans=0.0 2023-10-12 23:26:22,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1212708.0, ans=0.125 2023-10-12 23:26:22,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1212708.0, ans=0.95 2023-10-12 23:26:54,815 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=6.0 2023-10-12 23:27:05,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.06 vs. limit=12.0 2023-10-12 23:27:06,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=15.0 2023-10-12 23:27:13,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1212941.3333333333, ans=0.0 2023-10-12 23:27:18,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1212941.3333333333, ans=0.125 2023-10-12 23:27:25,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1212988.0, ans=0.0 2023-10-12 23:27:26,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.779e+02 1.950e+02 2.164e+02 2.885e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-12 23:27:28,020 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.12 vs. limit=12.0 2023-10-12 23:27:53,876 INFO [train.py:1031] (0/4) Epoch 20, batch 500, loss[loss=0.1966, simple_loss=0.2868, pruned_loss=0.05322, over 16889.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2834, pruned_loss=0.05067, over 7296129.76 frames. ], batch size: 110, lr: 1.75e-03, grad_scale: 16.0 2023-10-12 23:27:58,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1213081.3333333333, ans=0.125 2023-10-12 23:28:08,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-10-12 23:28:25,272 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-10-12 23:28:54,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1213268.0, ans=0.2 2023-10-12 23:29:52,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.838e+02 1.952e+02 2.214e+02 2.730e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-12 23:30:04,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.87 vs. limit=15.0 2023-10-12 23:30:07,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.65 vs. limit=6.0 2023-10-12 23:30:13,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1213548.0, ans=0.1 2023-10-12 23:30:14,216 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.05 vs. limit=6.0 2023-10-12 23:30:16,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1213548.0, ans=0.2 2023-10-12 23:30:38,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1213641.3333333333, ans=0.125 2023-10-12 23:30:39,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1213641.3333333333, ans=22.5 2023-10-12 23:31:04,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1213734.6666666667, ans=0.125 2023-10-12 23:31:05,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=15.0 2023-10-12 23:31:10,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1213781.3333333333, ans=0.07 2023-10-12 23:31:15,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1213781.3333333333, ans=0.1 2023-10-12 23:31:16,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1213781.3333333333, ans=0.0 2023-10-12 23:31:49,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.858e+02 2.054e+02 2.299e+02 3.420e+02, threshold=4.108e+02, percent-clipped=0.0 2023-10-12 23:32:11,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1213968.0, ans=0.1 2023-10-12 23:32:23,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1214014.6666666667, ans=0.125 2023-10-12 23:32:24,129 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.27 vs. limit=15.0 2023-10-12 23:33:01,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1214154.6666666667, ans=0.1 2023-10-12 23:33:01,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1214154.6666666667, ans=0.0 2023-10-12 23:33:03,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1214154.6666666667, ans=0.125 2023-10-12 23:33:09,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1214154.6666666667, ans=0.1 2023-10-12 23:33:25,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1214248.0, ans=0.125 2023-10-12 23:33:25,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.30 vs. limit=12.0 2023-10-12 23:33:37,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1214294.6666666667, ans=0.125 2023-10-12 23:33:57,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1214341.3333333333, ans=0.125 2023-10-12 23:34:03,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.690e+02 1.910e+02 2.143e+02 2.694e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-12 23:34:10,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1214388.0, ans=0.5 2023-10-12 23:34:23,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1214434.6666666667, ans=0.125 2023-10-12 23:34:31,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1214481.3333333333, ans=0.125 2023-10-12 23:34:47,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=12.0 2023-10-12 23:34:51,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1214574.6666666667, ans=15.0 2023-10-12 23:35:00,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1214574.6666666667, ans=0.0 2023-10-12 23:35:26,313 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.54 vs. limit=12.0 2023-10-12 23:35:26,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.20 vs. limit=15.0 2023-10-12 23:35:31,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1214668.0, ans=0.0 2023-10-12 23:35:51,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1214761.3333333333, ans=0.125 2023-10-12 23:36:07,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1214808.0, ans=0.125 2023-10-12 23:36:12,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.781e+02 2.034e+02 2.321e+02 3.164e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-12 23:36:15,907 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.07 vs. limit=15.0 2023-10-12 23:36:15,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=1214854.6666666667, ans=12.0 2023-10-12 23:36:15,977 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-10-12 23:36:17,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1214854.6666666667, ans=0.125 2023-10-12 23:37:14,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1215088.0, ans=0.125 2023-10-12 23:37:45,491 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.39 vs. limit=15.0 2023-10-12 23:37:51,478 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-10-12 23:38:08,493 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-10-12 23:38:18,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.729e+02 1.930e+02 2.077e+02 2.899e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-12 23:38:18,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1215321.3333333333, ans=0.1 2023-10-12 23:38:29,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1215368.0, ans=0.0 2023-10-12 23:38:34,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1215368.0, ans=0.125 2023-10-12 23:38:39,492 INFO [train.py:1031] (0/4) Epoch 20, batch 1000, loss[loss=0.1888, simple_loss=0.2761, pruned_loss=0.05081, over 16543.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2837, pruned_loss=0.0507, over 12926137.27 frames. ], batch size: 50, lr: 1.75e-03, grad_scale: 16.0 2023-10-12 23:38:49,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1215461.3333333333, ans=0.0 2023-10-12 23:38:49,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1215461.3333333333, ans=0.125 2023-10-12 23:38:51,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1215461.3333333333, ans=0.2 2023-10-12 23:38:54,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1215461.3333333333, ans=0.1 2023-10-12 23:39:05,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.78 vs. limit=15.0 2023-10-12 23:39:14,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1215554.6666666667, ans=0.0 2023-10-12 23:39:30,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1215601.3333333333, ans=0.05 2023-10-12 23:40:09,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.32 vs. limit=15.0 2023-10-12 23:40:14,575 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.766e+02 1.904e+02 2.150e+02 3.324e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-12 23:40:33,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1215881.3333333333, ans=0.95 2023-10-12 23:40:41,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1215881.3333333333, ans=0.125 2023-10-12 23:40:56,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.40 vs. limit=15.0 2023-10-12 23:40:58,752 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:40:59,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1215928.0, ans=0.125 2023-10-12 23:41:22,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1216021.3333333333, ans=0.125 2023-10-12 23:42:14,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1216208.0, ans=0.2 2023-10-12 23:42:24,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.693e+02 1.838e+02 2.115e+02 3.092e+02, threshold=3.676e+02, percent-clipped=0.0 2023-10-12 23:42:42,012 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-10-12 23:42:52,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1216348.0, ans=0.2 2023-10-12 23:42:54,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1216348.0, ans=0.125 2023-10-12 23:43:20,872 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:43:21,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1216441.3333333333, ans=0.125 2023-10-12 23:43:53,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-10-12 23:44:24,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.652e+02 1.826e+02 2.090e+02 2.868e+02, threshold=3.652e+02, percent-clipped=0.0 2023-10-12 23:44:39,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1216768.0, ans=0.125 2023-10-12 23:44:41,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1216768.0, ans=0.0 2023-10-12 23:44:49,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1216814.6666666667, ans=0.1 2023-10-12 23:45:14,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1216908.0, ans=0.125 2023-10-12 23:45:21,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1216954.6666666667, ans=0.125 2023-10-12 23:45:38,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1217048.0, ans=0.05 2023-10-12 23:45:45,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1217048.0, ans=0.0 2023-10-12 23:45:48,047 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.13 vs. limit=22.5 2023-10-12 23:45:50,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1217094.6666666667, ans=0.0 2023-10-12 23:46:05,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1217141.3333333333, ans=0.125 2023-10-12 23:46:15,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.690e+02 1.904e+02 2.129e+02 3.001e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-12 23:46:28,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1217234.6666666667, ans=0.2 2023-10-12 23:46:31,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1217234.6666666667, ans=0.2 2023-10-12 23:46:32,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1217234.6666666667, ans=0.125 2023-10-12 23:46:59,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1217328.0, ans=0.125 2023-10-12 23:47:17,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1217421.3333333333, ans=0.125 2023-10-12 23:47:26,015 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:47:42,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1217514.6666666667, ans=0.1 2023-10-12 23:47:43,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1217514.6666666667, ans=0.0 2023-10-12 23:47:56,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1217561.3333333333, ans=0.125 2023-10-12 23:48:00,194 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.63 vs. limit=15.0 2023-10-12 23:48:14,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1217608.0, ans=0.0 2023-10-12 23:48:25,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.753e+02 1.975e+02 2.148e+02 3.605e+02, threshold=3.949e+02, percent-clipped=0.0 2023-10-12 23:48:48,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=12.0 2023-10-12 23:48:49,167 INFO [train.py:1031] (0/4) Epoch 20, batch 1500, loss[loss=0.1701, simple_loss=0.2727, pruned_loss=0.0338, over 16951.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2815, pruned_loss=0.04941, over 17328720.05 frames. ], batch size: 93, lr: 1.75e-03, grad_scale: 32.0 2023-10-12 23:49:00,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1217794.6666666667, ans=0.1 2023-10-12 23:49:01,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1217794.6666666667, ans=0.07 2023-10-12 23:49:13,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1217841.3333333333, ans=0.125 2023-10-12 23:49:24,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1217888.0, ans=0.125 2023-10-12 23:49:28,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1217888.0, ans=0.125 2023-10-12 23:50:04,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1218028.0, ans=0.125 2023-10-12 23:50:19,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1218074.6666666667, ans=0.05 2023-10-12 23:50:19,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-10-12 23:50:33,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.785e+02 1.961e+02 2.168e+02 3.154e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-12 23:50:34,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.17 vs. limit=15.0 2023-10-12 23:50:58,796 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:51:02,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1218214.6666666667, ans=0.125 2023-10-12 23:51:04,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1218214.6666666667, ans=0.125 2023-10-12 23:51:12,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1218261.3333333333, ans=0.125 2023-10-12 23:51:45,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1218354.6666666667, ans=0.0 2023-10-12 23:51:48,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1218401.3333333333, ans=0.5 2023-10-12 23:52:29,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1218541.3333333333, ans=0.0 2023-10-12 23:52:38,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1218588.0, ans=0.0 2023-10-12 23:52:43,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.720e+02 1.856e+02 2.026e+02 3.023e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-12 23:52:51,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1218634.6666666667, ans=0.0 2023-10-12 23:52:57,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1218634.6666666667, ans=0.1 2023-10-12 23:53:07,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1218681.3333333333, ans=0.125 2023-10-12 23:53:11,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1218681.3333333333, ans=0.1 2023-10-12 23:53:33,228 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.08 vs. limit=10.0 2023-10-12 23:53:53,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1218868.0, ans=0.1 2023-10-12 23:53:58,103 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.23 vs. limit=12.0 2023-10-12 23:53:58,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1218868.0, ans=0.0 2023-10-12 23:53:59,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1218914.6666666667, ans=0.1 2023-10-12 23:54:01,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1218914.6666666667, ans=0.0 2023-10-12 23:54:05,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1218914.6666666667, ans=0.125 2023-10-12 23:54:14,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1218961.3333333333, ans=0.04949747468305833 2023-10-12 23:54:19,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1218961.3333333333, ans=0.125 2023-10-12 23:54:21,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1218961.3333333333, ans=0.125 2023-10-12 23:54:24,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1219008.0, ans=0.125 2023-10-12 23:54:44,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1219054.6666666667, ans=0.125 2023-10-12 23:54:45,578 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.785e+02 1.968e+02 2.240e+02 2.942e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-12 23:54:56,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1219101.3333333333, ans=0.05 2023-10-12 23:55:17,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-10-12 23:55:43,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1219288.0, ans=0.125 2023-10-12 23:55:53,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1219288.0, ans=0.0 2023-10-12 23:55:56,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1219334.6666666667, ans=0.125 2023-10-12 23:56:07,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1219381.3333333333, ans=0.0 2023-10-12 23:56:08,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1219381.3333333333, ans=0.1 2023-10-12 23:56:26,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1219428.0, ans=0.125 2023-10-12 23:56:34,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1219474.6666666667, ans=0.0 2023-10-12 23:56:35,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1219474.6666666667, ans=0.2 2023-10-12 23:56:50,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1219521.3333333333, ans=0.0 2023-10-12 23:56:50,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.770e+02 1.967e+02 2.269e+02 2.840e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-12 23:57:01,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1219568.0, ans=0.125 2023-10-12 23:57:02,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1219568.0, ans=0.0 2023-10-12 23:57:10,823 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-12 23:57:15,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1219614.6666666667, ans=0.07 2023-10-12 23:57:16,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1219614.6666666667, ans=0.0 2023-10-12 23:57:34,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219708.0, ans=0.1 2023-10-12 23:57:46,844 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-10-12 23:57:54,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219754.6666666667, ans=0.1 2023-10-12 23:57:54,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1219754.6666666667, ans=0.125 2023-10-12 23:58:31,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219848.0, ans=0.1 2023-10-12 23:58:35,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1219894.6666666667, ans=0.0 2023-10-12 23:58:38,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1219894.6666666667, ans=0.125 2023-10-12 23:58:49,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1219941.3333333333, ans=0.1 2023-10-12 23:58:52,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1219941.3333333333, ans=0.125 2023-10-12 23:59:05,604 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.765e+02 1.933e+02 2.173e+02 2.941e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-12 23:59:20,820 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=15.0 2023-10-12 23:59:26,506 INFO [train.py:1031] (0/4) Epoch 20, batch 2000, loss[loss=0.1946, simple_loss=0.2895, pruned_loss=0.04979, over 16998.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2822, pruned_loss=0.04957, over 20773297.38 frames. ], batch size: 82, lr: 1.75e-03, grad_scale: 16.0 2023-10-12 23:59:26,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1220081.3333333333, ans=0.02 2023-10-12 23:59:38,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2023-10-12 23:59:40,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1220128.0, ans=0.0 2023-10-12 23:59:58,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.99 vs. limit=15.0 2023-10-13 00:00:05,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1220221.3333333333, ans=0.125 2023-10-13 00:00:17,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1220221.3333333333, ans=0.125 2023-10-13 00:00:33,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1220314.6666666667, ans=0.125 2023-10-13 00:00:37,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1220314.6666666667, ans=0.125 2023-10-13 00:00:41,160 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.84 vs. limit=15.0 2023-10-13 00:00:41,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1220314.6666666667, ans=0.0 2023-10-13 00:01:17,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.682e+02 1.903e+02 2.210e+02 2.934e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-13 00:01:29,503 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:01:37,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1220548.0, ans=0.125 2023-10-13 00:01:59,762 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.38 vs. limit=15.0 2023-10-13 00:02:01,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1220594.6666666667, ans=0.125 2023-10-13 00:02:23,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.81 vs. limit=22.5 2023-10-13 00:02:37,478 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.51 vs. limit=22.5 2023-10-13 00:03:12,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1220734.6666666667, ans=0.125 2023-10-13 00:03:34,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1220828.0, ans=0.95 2023-10-13 00:04:07,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.800e+02 1.996e+02 2.161e+02 3.768e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 00:04:22,697 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=15.0 2023-10-13 00:04:24,902 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:04:31,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1221014.6666666667, ans=0.125 2023-10-13 00:04:44,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1221061.3333333333, ans=0.0 2023-10-13 00:04:52,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1221108.0, ans=0.2 2023-10-13 00:05:19,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1221201.3333333333, ans=0.2 2023-10-13 00:05:38,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1221294.6666666667, ans=0.125 2023-10-13 00:05:40,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1221294.6666666667, ans=0.0 2023-10-13 00:05:41,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1221294.6666666667, ans=0.0 2023-10-13 00:05:42,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1221294.6666666667, ans=0.125 2023-10-13 00:05:48,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1221341.3333333333, ans=0.0 2023-10-13 00:06:04,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.864e+02 2.050e+02 2.346e+02 4.191e+02, threshold=4.100e+02, percent-clipped=1.0 2023-10-13 00:06:05,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1221388.0, ans=0.125 2023-10-13 00:06:29,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.57 vs. limit=15.0 2023-10-13 00:06:41,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1221528.0, ans=0.0 2023-10-13 00:06:42,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-10-13 00:07:16,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1221668.0, ans=0.1 2023-10-13 00:07:16,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1221668.0, ans=0.125 2023-10-13 00:07:34,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1221714.6666666667, ans=0.125 2023-10-13 00:07:40,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1221761.3333333333, ans=0.125 2023-10-13 00:07:53,726 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-10-13 00:07:58,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.50 vs. limit=22.5 2023-10-13 00:08:08,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.797e+02 1.937e+02 2.103e+02 2.617e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-13 00:08:10,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1221854.6666666667, ans=0.07 2023-10-13 00:08:21,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1221901.3333333333, ans=0.0 2023-10-13 00:08:23,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1221948.0, ans=0.0 2023-10-13 00:08:27,809 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.92 vs. limit=15.0 2023-10-13 00:08:32,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1221948.0, ans=0.1 2023-10-13 00:08:38,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1221994.6666666667, ans=0.125 2023-10-13 00:09:37,988 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.32 vs. limit=15.0 2023-10-13 00:09:42,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1222274.6666666667, ans=0.0 2023-10-13 00:09:57,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1222321.3333333333, ans=0.125 2023-10-13 00:10:00,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1222321.3333333333, ans=0.2 2023-10-13 00:10:01,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.721e+02 1.949e+02 2.146e+02 2.819e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-13 00:10:01,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1222321.3333333333, ans=0.0 2023-10-13 00:10:12,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1222368.0, ans=0.125 2023-10-13 00:10:15,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1222368.0, ans=0.0 2023-10-13 00:10:17,634 INFO [train.py:1031] (0/4) Epoch 20, batch 2500, loss[loss=0.198, simple_loss=0.3005, pruned_loss=0.04774, over 16912.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2826, pruned_loss=0.0497, over 23452816.99 frames. ], batch size: 104, lr: 1.75e-03, grad_scale: 16.0 2023-10-13 00:10:30,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1222461.3333333333, ans=0.125 2023-10-13 00:10:35,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1222461.3333333333, ans=0.0 2023-10-13 00:11:04,414 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-10-13 00:11:06,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1222601.3333333333, ans=0.2 2023-10-13 00:11:10,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1222601.3333333333, ans=0.0 2023-10-13 00:11:17,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1222648.0, ans=0.0 2023-10-13 00:11:35,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1222694.6666666667, ans=0.1 2023-10-13 00:11:37,348 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.34 vs. limit=22.5 2023-10-13 00:11:58,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.816e+02 2.001e+02 2.170e+02 3.087e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-13 00:12:09,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1222834.6666666667, ans=0.2 2023-10-13 00:12:15,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1222881.3333333333, ans=0.125 2023-10-13 00:12:17,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1222881.3333333333, ans=0.0 2023-10-13 00:12:22,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1222881.3333333333, ans=0.125 2023-10-13 00:12:30,713 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:12:35,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1222928.0, ans=0.0 2023-10-13 00:12:41,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1222974.6666666667, ans=0.125 2023-10-13 00:12:49,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1222974.6666666667, ans=0.125 2023-10-13 00:12:55,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1223021.3333333333, ans=0.125 2023-10-13 00:13:07,042 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.92 vs. limit=12.0 2023-10-13 00:13:30,421 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.71 vs. limit=15.0 2023-10-13 00:13:46,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-10-13 00:13:57,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.784e+02 1.942e+02 2.299e+02 3.428e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-13 00:14:27,447 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.58 vs. limit=15.0 2023-10-13 00:14:39,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1223394.6666666667, ans=0.0 2023-10-13 00:14:40,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1223441.3333333333, ans=0.125 2023-10-13 00:14:47,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1223441.3333333333, ans=0.0 2023-10-13 00:14:50,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.42 vs. limit=15.0 2023-10-13 00:15:48,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1223628.0, ans=0.2 2023-10-13 00:15:54,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-10-13 00:15:57,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1223674.6666666667, ans=0.0 2023-10-13 00:16:00,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1223674.6666666667, ans=0.0 2023-10-13 00:16:06,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1223721.3333333333, ans=0.125 2023-10-13 00:16:09,871 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.721e+02 1.878e+02 2.077e+02 2.891e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-13 00:16:34,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1223814.6666666667, ans=0.125 2023-10-13 00:16:39,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1223861.3333333333, ans=0.125 2023-10-13 00:17:02,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.87 vs. limit=15.0 2023-10-13 00:17:09,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1223954.6666666667, ans=0.2 2023-10-13 00:17:42,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1224048.0, ans=0.125 2023-10-13 00:17:50,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1224048.0, ans=0.1 2023-10-13 00:17:52,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1224048.0, ans=0.2 2023-10-13 00:17:55,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1224094.6666666667, ans=0.2 2023-10-13 00:17:56,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1224094.6666666667, ans=0.125 2023-10-13 00:18:21,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1224141.3333333333, ans=0.0 2023-10-13 00:18:23,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1224141.3333333333, ans=0.125 2023-10-13 00:18:24,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1224188.0, ans=0.125 2023-10-13 00:18:32,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.789e+02 1.988e+02 2.204e+02 2.957e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-13 00:18:39,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1224234.6666666667, ans=0.0 2023-10-13 00:19:20,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1224374.6666666667, ans=0.125 2023-10-13 00:19:41,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1224421.3333333333, ans=0.1 2023-10-13 00:19:48,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=22.5 2023-10-13 00:19:54,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=22.5 2023-10-13 00:19:58,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1224514.6666666667, ans=0.0 2023-10-13 00:19:59,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1224514.6666666667, ans=0.125 2023-10-13 00:20:13,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-13 00:20:15,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1224561.3333333333, ans=0.09899494936611666 2023-10-13 00:20:16,076 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-10-13 00:20:17,141 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.93 vs. limit=10.0 2023-10-13 00:20:20,842 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.42 vs. limit=22.5 2023-10-13 00:20:21,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1224561.3333333333, ans=0.125 2023-10-13 00:20:24,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1224608.0, ans=0.0 2023-10-13 00:20:25,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1224608.0, ans=0.0 2023-10-13 00:20:37,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1224654.6666666667, ans=0.0 2023-10-13 00:20:44,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.759e+02 1.874e+02 2.097e+02 2.796e+02, threshold=3.748e+02, percent-clipped=0.0 2023-10-13 00:20:47,552 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.59 vs. limit=15.0 2023-10-13 00:20:48,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1224701.3333333333, ans=0.0 2023-10-13 00:21:01,721 INFO [train.py:1031] (0/4) Epoch 20, batch 3000, loss[loss=0.1997, simple_loss=0.286, pruned_loss=0.05672, over 16065.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2817, pruned_loss=0.0496, over 25532368.81 frames. ], batch size: 43, lr: 1.74e-03, grad_scale: 16.0 2023-10-13 00:21:08,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1224748.0, ans=0.125 2023-10-13 00:22:17,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1225028.0, ans=0.125 2023-10-13 00:22:35,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1225074.6666666667, ans=0.125 2023-10-13 00:22:42,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.74 vs. limit=6.0 2023-10-13 00:22:48,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.798e+02 1.998e+02 2.305e+02 3.052e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-13 00:22:49,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1225121.3333333333, ans=0.125 2023-10-13 00:22:49,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1225121.3333333333, ans=0.125 2023-10-13 00:23:38,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1225308.0, ans=0.2 2023-10-13 00:23:44,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1225354.6666666667, ans=0.1 2023-10-13 00:23:47,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1225354.6666666667, ans=0.1 2023-10-13 00:23:54,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1225354.6666666667, ans=0.2 2023-10-13 00:24:00,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1225401.3333333333, ans=0.2 2023-10-13 00:24:02,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1225401.3333333333, ans=0.1 2023-10-13 00:24:05,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1225401.3333333333, ans=0.05 2023-10-13 00:24:20,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.10 vs. limit=12.0 2023-10-13 00:24:27,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1225494.6666666667, ans=0.125 2023-10-13 00:24:30,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1225541.3333333333, ans=0.125 2023-10-13 00:24:35,326 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.90 vs. limit=6.0 2023-10-13 00:24:45,482 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:24:45,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1225588.0, ans=0.125 2023-10-13 00:24:49,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.745e+02 1.899e+02 2.145e+02 2.873e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-13 00:24:50,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1225588.0, ans=0.125 2023-10-13 00:24:50,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1225588.0, ans=0.125 2023-10-13 00:25:04,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1225681.3333333333, ans=0.1 2023-10-13 00:25:34,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1225774.6666666667, ans=0.2 2023-10-13 00:25:36,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1225774.6666666667, ans=0.125 2023-10-13 00:25:52,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1225821.3333333333, ans=0.0 2023-10-13 00:25:53,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1225821.3333333333, ans=0.0 2023-10-13 00:25:59,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1225868.0, ans=0.1 2023-10-13 00:26:25,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1225914.6666666667, ans=0.125 2023-10-13 00:26:34,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1225961.3333333333, ans=0.2 2023-10-13 00:26:37,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1225961.3333333333, ans=0.0 2023-10-13 00:27:04,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.835e+02 1.998e+02 2.141e+02 3.029e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-13 00:27:23,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.97 vs. limit=15.0 2023-10-13 00:27:42,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1226194.6666666667, ans=0.0 2023-10-13 00:27:58,844 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.91 vs. limit=15.0 2023-10-13 00:28:15,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1226334.6666666667, ans=0.2 2023-10-13 00:28:18,020 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.76 vs. limit=6.0 2023-10-13 00:28:36,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1226428.0, ans=0.125 2023-10-13 00:28:42,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1226428.0, ans=0.2 2023-10-13 00:29:01,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1226521.3333333333, ans=0.125 2023-10-13 00:29:13,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.772e+02 1.913e+02 2.123e+02 2.620e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-13 00:29:33,598 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.60 vs. limit=22.5 2023-10-13 00:29:56,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1226708.0, ans=0.1 2023-10-13 00:30:04,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1226754.6666666667, ans=0.2 2023-10-13 00:30:25,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1226848.0, ans=0.1 2023-10-13 00:30:34,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1226848.0, ans=0.2 2023-10-13 00:30:40,588 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:30:42,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1226894.6666666667, ans=0.125 2023-10-13 00:30:48,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1226894.6666666667, ans=0.2 2023-10-13 00:31:14,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.816e+02 1.966e+02 2.170e+02 2.740e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-13 00:31:29,543 INFO [train.py:1031] (0/4) Epoch 20, batch 3500, loss[loss=0.1907, simple_loss=0.2902, pruned_loss=0.04562, over 16939.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2817, pruned_loss=0.04963, over 27155695.64 frames. ], batch size: 138, lr: 1.74e-03, grad_scale: 16.0 2023-10-13 00:31:48,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1227128.0, ans=0.2 2023-10-13 00:32:16,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1227221.3333333333, ans=0.0 2023-10-13 00:32:17,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1227221.3333333333, ans=0.1 2023-10-13 00:32:45,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1227314.6666666667, ans=0.0 2023-10-13 00:33:24,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1227454.6666666667, ans=0.125 2023-10-13 00:33:32,156 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:33:34,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.819e+02 2.018e+02 2.440e+02 3.265e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-13 00:33:53,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1227548.0, ans=0.125 2023-10-13 00:34:13,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1227594.6666666667, ans=0.1 2023-10-13 00:34:15,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1227641.3333333333, ans=0.125 2023-10-13 00:34:15,864 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:34:27,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1227641.3333333333, ans=0.125 2023-10-13 00:34:27,697 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.34 vs. limit=15.0 2023-10-13 00:34:41,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1227688.0, ans=0.125 2023-10-13 00:35:01,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1227781.3333333333, ans=0.1 2023-10-13 00:35:11,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1227828.0, ans=0.07 2023-10-13 00:35:15,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1227828.0, ans=0.125 2023-10-13 00:35:24,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1227874.6666666667, ans=0.125 2023-10-13 00:35:25,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1227874.6666666667, ans=0.0 2023-10-13 00:35:47,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.726e+02 1.854e+02 2.049e+02 3.424e+02, threshold=3.708e+02, percent-clipped=0.0 2023-10-13 00:35:57,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1227968.0, ans=0.0 2023-10-13 00:36:26,814 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:36:31,363 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.84 vs. limit=10.0 2023-10-13 00:36:43,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1228108.0, ans=0.2 2023-10-13 00:37:16,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1228248.0, ans=0.1 2023-10-13 00:37:54,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1228388.0, ans=0.125 2023-10-13 00:38:06,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.680e+02 1.875e+02 2.042e+02 2.976e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-13 00:38:07,289 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:38:20,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1228481.3333333333, ans=0.2 2023-10-13 00:38:21,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1228481.3333333333, ans=0.0 2023-10-13 00:38:22,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.16 vs. limit=22.5 2023-10-13 00:38:25,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1228481.3333333333, ans=0.0 2023-10-13 00:38:30,842 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.64 vs. limit=15.0 2023-10-13 00:38:49,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1228574.6666666667, ans=0.125 2023-10-13 00:38:51,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1228574.6666666667, ans=0.125 2023-10-13 00:38:56,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1228621.3333333333, ans=0.125 2023-10-13 00:38:59,796 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-10-13 00:39:03,521 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-10-13 00:39:12,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1228668.0, ans=0.125 2023-10-13 00:39:34,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1228761.3333333333, ans=0.2 2023-10-13 00:39:55,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.72 vs. limit=22.5 2023-10-13 00:40:08,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.717e+02 1.889e+02 2.084e+02 3.098e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-13 00:40:15,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1228901.3333333333, ans=0.0 2023-10-13 00:40:17,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-10-13 00:40:41,678 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.88 vs. limit=15.0 2023-10-13 00:40:46,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1229041.3333333333, ans=0.07 2023-10-13 00:40:48,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1229041.3333333333, ans=0.04949747468305833 2023-10-13 00:40:50,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1229041.3333333333, ans=0.125 2023-10-13 00:41:20,524 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=12.0 2023-10-13 00:41:51,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1229274.6666666667, ans=0.125 2023-10-13 00:42:13,296 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.357e+02 1.746e+02 1.899e+02 2.168e+02 2.940e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-13 00:42:27,167 INFO [train.py:1031] (0/4) Epoch 20, batch 4000, loss[loss=0.2004, simple_loss=0.2952, pruned_loss=0.05284, over 16812.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2813, pruned_loss=0.04976, over 28396453.41 frames. ], batch size: 175, lr: 1.74e-03, grad_scale: 32.0 2023-10-13 00:42:32,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1229414.6666666667, ans=0.1 2023-10-13 00:42:38,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1229414.6666666667, ans=0.125 2023-10-13 00:43:04,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1229508.0, ans=15.0 2023-10-13 00:43:12,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1229554.6666666667, ans=0.125 2023-10-13 00:43:45,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1229694.6666666667, ans=0.125 2023-10-13 00:43:47,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.78 vs. limit=22.5 2023-10-13 00:43:47,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1229694.6666666667, ans=0.0 2023-10-13 00:43:58,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229741.3333333333, ans=0.1 2023-10-13 00:44:01,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229741.3333333333, ans=0.1 2023-10-13 00:44:08,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229788.0, ans=0.1 2023-10-13 00:44:14,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1229788.0, ans=0.125 2023-10-13 00:44:14,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.746e+02 1.878e+02 2.082e+02 2.617e+02, threshold=3.755e+02, percent-clipped=0.0 2023-10-13 00:44:24,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1229834.6666666667, ans=0.0 2023-10-13 00:44:30,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-10-13 00:44:35,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.03 vs. limit=15.0 2023-10-13 00:44:46,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1229928.0, ans=0.125 2023-10-13 00:45:04,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1230021.3333333333, ans=0.125 2023-10-13 00:45:21,049 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.32 vs. limit=15.0 2023-10-13 00:45:37,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1230161.3333333333, ans=0.1 2023-10-13 00:45:40,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1230161.3333333333, ans=0.0 2023-10-13 00:46:19,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1230254.6666666667, ans=0.2 2023-10-13 00:46:25,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1230254.6666666667, ans=0.2 2023-10-13 00:46:25,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.800e+02 2.017e+02 2.267e+02 3.667e+02, threshold=4.034e+02, percent-clipped=0.0 2023-10-13 00:46:29,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1230301.3333333333, ans=0.0 2023-10-13 00:46:34,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1230301.3333333333, ans=0.0 2023-10-13 00:46:34,840 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-10-13 00:46:39,500 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.80 vs. limit=12.0 2023-10-13 00:46:50,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1230394.6666666667, ans=0.0 2023-10-13 00:47:12,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1230441.3333333333, ans=0.1 2023-10-13 00:47:17,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1230488.0, ans=0.125 2023-10-13 00:47:21,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1230488.0, ans=0.125 2023-10-13 00:47:35,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1230534.6666666667, ans=0.125 2023-10-13 00:47:48,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1230581.3333333333, ans=0.07 2023-10-13 00:48:08,107 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.89 vs. limit=22.5 2023-10-13 00:48:10,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1230674.6666666667, ans=0.125 2023-10-13 00:48:19,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1230721.3333333333, ans=0.125 2023-10-13 00:48:25,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.770e+02 1.937e+02 2.264e+02 3.243e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-13 00:48:31,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1230768.0, ans=0.125 2023-10-13 00:48:34,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1230768.0, ans=0.1 2023-10-13 00:48:41,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1230814.6666666667, ans=0.125 2023-10-13 00:48:45,553 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 00:48:54,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1230861.3333333333, ans=0.2 2023-10-13 00:48:57,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1230861.3333333333, ans=0.125 2023-10-13 00:49:04,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1230908.0, ans=0.1 2023-10-13 00:49:06,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1230908.0, ans=0.0 2023-10-13 00:49:45,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1231048.0, ans=0.125 2023-10-13 00:49:45,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1231048.0, ans=0.125 2023-10-13 00:49:46,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1231048.0, ans=0.125 2023-10-13 00:49:46,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1231048.0, ans=0.0 2023-10-13 00:49:54,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-13 00:49:55,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1231094.6666666667, ans=0.0 2023-10-13 00:50:16,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1231188.0, ans=0.125 2023-10-13 00:50:27,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.864e+02 2.027e+02 2.305e+02 3.344e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-13 00:51:21,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1231374.6666666667, ans=0.2 2023-10-13 00:51:39,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1231421.3333333333, ans=0.125 2023-10-13 00:51:51,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1231468.0, ans=0.1 2023-10-13 00:52:04,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1231514.6666666667, ans=0.125 2023-10-13 00:52:11,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1231514.6666666667, ans=0.0 2023-10-13 00:52:22,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1231561.3333333333, ans=0.125 2023-10-13 00:53:03,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.789e+02 1.973e+02 2.179e+02 3.350e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-13 00:53:04,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1231701.3333333333, ans=0.1 2023-10-13 00:53:06,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.44 vs. limit=22.5 2023-10-13 00:53:09,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1231701.3333333333, ans=0.125 2023-10-13 00:53:16,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1231748.0, ans=0.04949747468305833 2023-10-13 00:53:17,689 INFO [train.py:1031] (0/4) Epoch 20, batch 4500, loss[loss=0.1784, simple_loss=0.2749, pruned_loss=0.04094, over 16866.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2817, pruned_loss=0.04963, over 29361998.95 frames. ], batch size: 155, lr: 1.74e-03, grad_scale: 32.0 2023-10-13 00:53:17,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1231748.0, ans=0.125 2023-10-13 00:53:38,646 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.94 vs. limit=6.0 2023-10-13 00:54:03,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231888.0, ans=0.1 2023-10-13 00:54:09,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1231888.0, ans=0.125 2023-10-13 00:54:27,144 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-264000.pt 2023-10-13 00:54:42,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1232028.0, ans=0.125 2023-10-13 00:54:45,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1232028.0, ans=0.0 2023-10-13 00:54:49,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.04 vs. limit=15.0 2023-10-13 00:54:58,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1232074.6666666667, ans=0.0 2023-10-13 00:55:01,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1232121.3333333333, ans=0.125 2023-10-13 00:55:04,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1232121.3333333333, ans=0.1 2023-10-13 00:55:11,554 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-10-13 00:55:13,130 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.807e+02 1.952e+02 2.319e+02 3.341e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-13 00:55:20,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1232168.0, ans=0.1 2023-10-13 00:55:25,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1232214.6666666667, ans=0.125 2023-10-13 00:55:27,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1232214.6666666667, ans=0.0 2023-10-13 00:55:29,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1232214.6666666667, ans=0.2 2023-10-13 00:55:30,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1232214.6666666667, ans=0.125 2023-10-13 00:55:39,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.99 vs. limit=22.5 2023-10-13 00:55:43,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1232261.3333333333, ans=0.1 2023-10-13 00:55:51,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1232308.0, ans=0.0 2023-10-13 00:55:52,775 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-10-13 00:55:58,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1232354.6666666667, ans=0.1 2023-10-13 00:56:08,583 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-10-13 00:56:29,064 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-13 00:56:41,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1232541.3333333333, ans=0.125 2023-10-13 00:56:47,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1232541.3333333333, ans=0.0 2023-10-13 00:56:47,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1232541.3333333333, ans=0.0 2023-10-13 00:57:04,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1232588.0, ans=0.125 2023-10-13 00:57:04,287 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.21 vs. limit=10.0 2023-10-13 00:57:06,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.797e+02 1.996e+02 2.225e+02 3.605e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 00:57:28,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-10-13 00:57:35,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1232681.3333333333, ans=0.125 2023-10-13 00:57:36,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1232681.3333333333, ans=0.125 2023-10-13 00:58:01,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1232774.6666666667, ans=0.125 2023-10-13 00:58:05,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1232774.6666666667, ans=0.09899494936611666 2023-10-13 00:58:28,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.89 vs. limit=12.0 2023-10-13 00:58:41,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1232914.6666666667, ans=0.0 2023-10-13 00:58:50,233 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-10-13 00:59:14,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1233054.6666666667, ans=0.0 2023-10-13 00:59:19,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.805e+02 1.954e+02 2.241e+02 3.244e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-13 00:59:28,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1233148.0, ans=0.1 2023-10-13 00:59:39,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1233148.0, ans=0.125 2023-10-13 00:59:54,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1233194.6666666667, ans=0.0 2023-10-13 01:00:14,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1233288.0, ans=0.125 2023-10-13 01:00:25,338 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-10-13 01:00:47,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1233428.0, ans=0.0 2023-10-13 01:00:53,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1233428.0, ans=0.0 2023-10-13 01:01:10,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=1233521.3333333333, ans=15.0 2023-10-13 01:01:17,161 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.47 vs. limit=15.0 2023-10-13 01:01:18,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1233568.0, ans=0.0 2023-10-13 01:01:19,565 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.670e+02 1.834e+02 2.022e+02 2.686e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-13 01:01:41,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1233614.6666666667, ans=0.0 2023-10-13 01:01:48,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1233661.3333333333, ans=0.0 2023-10-13 01:01:49,018 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.09 vs. limit=15.0 2023-10-13 01:02:03,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1233708.0, ans=0.125 2023-10-13 01:02:34,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1233801.3333333333, ans=6.0 2023-10-13 01:02:49,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.82 vs. limit=12.0 2023-10-13 01:03:10,496 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.66 vs. limit=22.5 2023-10-13 01:03:11,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1233941.3333333333, ans=0.0 2023-10-13 01:03:14,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1233941.3333333333, ans=0.125 2023-10-13 01:03:26,094 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2023-10-13 01:03:32,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.760e+02 1.902e+02 2.132e+02 3.724e+02, threshold=3.804e+02, percent-clipped=1.0 2023-10-13 01:03:33,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1234034.6666666667, ans=0.125 2023-10-13 01:03:42,685 INFO [train.py:1031] (0/4) Epoch 20, batch 5000, loss[loss=0.1913, simple_loss=0.2837, pruned_loss=0.04946, over 16945.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2812, pruned_loss=0.04971, over 30095248.64 frames. ], batch size: 130, lr: 1.74e-03, grad_scale: 16.0 2023-10-13 01:03:53,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1234081.3333333333, ans=0.125 2023-10-13 01:04:06,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1234174.6666666667, ans=0.1 2023-10-13 01:04:21,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1234221.3333333333, ans=0.1 2023-10-13 01:04:25,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1234221.3333333333, ans=0.035 2023-10-13 01:04:33,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.31 vs. limit=15.0 2023-10-13 01:04:38,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1234268.0, ans=0.125 2023-10-13 01:04:44,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1234314.6666666667, ans=0.125 2023-10-13 01:04:48,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1234314.6666666667, ans=0.125 2023-10-13 01:05:00,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1234361.3333333333, ans=0.125 2023-10-13 01:05:02,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1234361.3333333333, ans=0.125 2023-10-13 01:05:18,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1234454.6666666667, ans=0.0 2023-10-13 01:05:32,518 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=12.0 2023-10-13 01:05:36,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.826e+02 1.999e+02 2.240e+02 3.305e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-13 01:06:17,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1234641.3333333333, ans=0.0 2023-10-13 01:06:20,117 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-10-13 01:06:28,756 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-10-13 01:06:38,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1234734.6666666667, ans=0.125 2023-10-13 01:06:50,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1234781.3333333333, ans=0.125 2023-10-13 01:07:25,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1234921.3333333333, ans=0.0 2023-10-13 01:07:37,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.733e+02 1.889e+02 2.151e+02 2.721e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-13 01:07:53,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1235014.6666666667, ans=0.1 2023-10-13 01:07:55,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1235014.6666666667, ans=0.0 2023-10-13 01:07:59,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1235061.3333333333, ans=0.0 2023-10-13 01:08:01,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1235061.3333333333, ans=0.0 2023-10-13 01:08:15,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1235108.0, ans=0.125 2023-10-13 01:08:21,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1235108.0, ans=0.125 2023-10-13 01:09:43,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1235388.0, ans=0.0 2023-10-13 01:09:45,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.01 vs. limit=15.0 2023-10-13 01:09:49,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.816e+02 1.967e+02 2.221e+02 3.663e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-13 01:09:52,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.08 vs. limit=22.5 2023-10-13 01:10:16,492 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:10:20,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1235528.0, ans=0.0 2023-10-13 01:10:21,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1235528.0, ans=0.2 2023-10-13 01:10:25,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1235574.6666666667, ans=0.125 2023-10-13 01:10:31,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1235574.6666666667, ans=0.125 2023-10-13 01:11:10,719 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-10-13 01:11:13,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.22 vs. limit=6.0 2023-10-13 01:11:33,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1235808.0, ans=0.125 2023-10-13 01:11:33,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1235808.0, ans=0.125 2023-10-13 01:11:40,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1235808.0, ans=0.1 2023-10-13 01:11:43,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1235854.6666666667, ans=0.1 2023-10-13 01:11:54,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1235901.3333333333, ans=0.0 2023-10-13 01:11:58,140 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.724e+02 1.890e+02 2.180e+02 3.031e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-13 01:12:25,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-10-13 01:12:36,934 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.23 vs. limit=15.0 2023-10-13 01:12:42,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1236041.3333333333, ans=0.125 2023-10-13 01:13:09,673 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-10-13 01:13:18,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1236228.0, ans=0.125 2023-10-13 01:13:19,352 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2023-10-13 01:13:21,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1236228.0, ans=0.0 2023-10-13 01:13:32,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1236274.6666666667, ans=0.1 2023-10-13 01:13:53,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1236321.3333333333, ans=0.0 2023-10-13 01:13:57,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.768e+02 1.908e+02 2.135e+02 2.825e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-13 01:14:01,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1236368.0, ans=0.125 2023-10-13 01:14:04,714 INFO [train.py:1031] (0/4) Epoch 20, batch 5500, loss[loss=0.2312, simple_loss=0.3044, pruned_loss=0.07899, over 15710.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2813, pruned_loss=0.04953, over 30735903.33 frames. ], batch size: 350, lr: 1.74e-03, grad_scale: 8.0 2023-10-13 01:14:13,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1236414.6666666667, ans=0.1 2023-10-13 01:14:33,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1236508.0, ans=0.125 2023-10-13 01:14:49,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1236554.6666666667, ans=0.125 2023-10-13 01:14:53,571 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-10-13 01:14:54,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1236601.3333333333, ans=0.125 2023-10-13 01:14:55,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1236601.3333333333, ans=0.1 2023-10-13 01:14:57,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1236601.3333333333, ans=0.125 2023-10-13 01:15:02,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=15.0 2023-10-13 01:15:51,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.726e+02 1.926e+02 2.123e+02 3.044e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-13 01:16:00,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1236881.3333333333, ans=0.2 2023-10-13 01:16:02,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1236881.3333333333, ans=0.2 2023-10-13 01:16:18,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1236928.0, ans=0.0 2023-10-13 01:16:28,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236974.6666666667, ans=0.1 2023-10-13 01:16:35,675 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.56 vs. limit=15.0 2023-10-13 01:16:52,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1237068.0, ans=0.125 2023-10-13 01:17:01,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1237114.6666666667, ans=0.0 2023-10-13 01:17:03,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1237114.6666666667, ans=15.0 2023-10-13 01:17:06,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1237114.6666666667, ans=0.2 2023-10-13 01:17:36,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1237254.6666666667, ans=0.125 2023-10-13 01:17:39,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1237254.6666666667, ans=0.125 2023-10-13 01:17:46,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1237301.3333333333, ans=0.125 2023-10-13 01:17:47,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=12.0 2023-10-13 01:17:51,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.764e+02 2.022e+02 2.250e+02 3.316e+02, threshold=4.045e+02, percent-clipped=0.0 2023-10-13 01:17:53,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1237301.3333333333, ans=0.125 2023-10-13 01:18:18,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1237394.6666666667, ans=0.0 2023-10-13 01:18:26,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-10-13 01:18:53,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1237534.6666666667, ans=0.04949747468305833 2023-10-13 01:19:30,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1237674.6666666667, ans=0.0 2023-10-13 01:19:37,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1237721.3333333333, ans=0.125 2023-10-13 01:19:43,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1237721.3333333333, ans=0.2 2023-10-13 01:19:53,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.789e+02 1.973e+02 2.231e+02 4.908e+02, threshold=3.946e+02, percent-clipped=1.0 2023-10-13 01:19:58,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1237768.0, ans=0.0 2023-10-13 01:20:03,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1237814.6666666667, ans=0.2 2023-10-13 01:20:10,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1237814.6666666667, ans=0.05 2023-10-13 01:20:21,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1237861.3333333333, ans=0.1 2023-10-13 01:20:29,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1237908.0, ans=0.125 2023-10-13 01:21:27,791 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.78 vs. limit=15.0 2023-10-13 01:21:35,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1238141.3333333333, ans=0.125 2023-10-13 01:21:54,941 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.750e+02 1.904e+02 2.096e+02 3.025e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-13 01:21:58,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1238234.6666666667, ans=0.0 2023-10-13 01:22:12,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1238281.3333333333, ans=0.125 2023-10-13 01:22:12,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1238281.3333333333, ans=0.2 2023-10-13 01:22:32,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1238374.6666666667, ans=0.125 2023-10-13 01:22:37,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1238374.6666666667, ans=0.035 2023-10-13 01:22:39,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1238374.6666666667, ans=0.125 2023-10-13 01:22:40,591 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:22:43,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1238421.3333333333, ans=0.125 2023-10-13 01:22:50,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238421.3333333333, ans=0.1 2023-10-13 01:23:03,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1238468.0, ans=0.2 2023-10-13 01:23:06,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1238514.6666666667, ans=0.125 2023-10-13 01:23:12,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1238514.6666666667, ans=0.0 2023-10-13 01:23:15,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1238561.3333333333, ans=0.125 2023-10-13 01:23:18,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1238561.3333333333, ans=0.125 2023-10-13 01:23:31,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1238608.0, ans=0.125 2023-10-13 01:23:38,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1238654.6666666667, ans=0.5 2023-10-13 01:23:55,588 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.758e+02 1.940e+02 2.168e+02 2.907e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-13 01:23:55,934 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:23:56,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1238701.3333333333, ans=0.1 2023-10-13 01:24:02,802 INFO [train.py:1031] (0/4) Epoch 20, batch 6000, loss[loss=0.1919, simple_loss=0.2841, pruned_loss=0.04987, over 16679.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2816, pruned_loss=0.04991, over 31160449.43 frames. ], batch size: 202, lr: 1.73e-03, grad_scale: 16.0 2023-10-13 01:24:10,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-13 01:24:25,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1238794.6666666667, ans=0.125 2023-10-13 01:24:26,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238841.3333333333, ans=0.1 2023-10-13 01:24:29,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.62 vs. limit=15.0 2023-10-13 01:24:47,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1238888.0, ans=0.125 2023-10-13 01:25:04,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1238934.6666666667, ans=0.125 2023-10-13 01:25:17,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1239028.0, ans=0.125 2023-10-13 01:25:59,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.767e+02 1.932e+02 2.129e+02 3.413e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-13 01:26:06,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1239214.6666666667, ans=0.0 2023-10-13 01:26:07,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1239214.6666666667, ans=0.125 2023-10-13 01:26:08,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1239214.6666666667, ans=0.125 2023-10-13 01:26:19,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1239261.3333333333, ans=0.0 2023-10-13 01:26:31,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1239308.0, ans=0.0 2023-10-13 01:26:44,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1239354.6666666667, ans=0.2 2023-10-13 01:26:45,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1239354.6666666667, ans=0.025 2023-10-13 01:27:15,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1239448.0, ans=0.125 2023-10-13 01:27:21,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1239494.6666666667, ans=0.125 2023-10-13 01:27:29,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1239541.3333333333, ans=0.0 2023-10-13 01:27:32,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.30 vs. limit=10.0 2023-10-13 01:28:01,954 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 1.853e+02 2.011e+02 2.168e+02 2.642e+02, threshold=4.022e+02, percent-clipped=0.0 2023-10-13 01:28:06,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1239634.6666666667, ans=0.125 2023-10-13 01:28:34,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1239774.6666666667, ans=0.07 2023-10-13 01:28:55,471 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-10-13 01:28:56,252 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:29:03,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1239868.0, ans=0.125 2023-10-13 01:29:14,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1239914.6666666667, ans=0.125 2023-10-13 01:29:19,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1239961.3333333333, ans=0.125 2023-10-13 01:29:21,998 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.83 vs. limit=22.5 2023-10-13 01:29:29,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1240008.0, ans=0.125 2023-10-13 01:29:55,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1240101.3333333333, ans=0.125 2023-10-13 01:30:01,284 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.766e+02 1.901e+02 2.094e+02 3.326e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-13 01:30:12,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1240148.0, ans=0.125 2023-10-13 01:30:29,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1240194.6666666667, ans=0.125 2023-10-13 01:31:09,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1240334.6666666667, ans=0.125 2023-10-13 01:31:15,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1240381.3333333333, ans=0.5 2023-10-13 01:31:24,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1240381.3333333333, ans=0.125 2023-10-13 01:31:36,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=12.0 2023-10-13 01:31:43,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1240428.0, ans=0.125 2023-10-13 01:31:55,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.21 vs. limit=12.0 2023-10-13 01:32:06,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1240521.3333333333, ans=0.125 2023-10-13 01:32:14,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1240568.0, ans=0.025 2023-10-13 01:32:22,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.778e+02 2.024e+02 2.290e+02 3.250e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-13 01:33:02,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1240708.0, ans=0.125 2023-10-13 01:33:17,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1240801.3333333333, ans=0.125 2023-10-13 01:33:26,411 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2023-10-13 01:33:27,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1240848.0, ans=0.2 2023-10-13 01:33:48,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-10-13 01:34:21,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.704e+02 1.949e+02 2.211e+02 3.432e+02, threshold=3.898e+02, percent-clipped=0.0 2023-10-13 01:34:22,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.38 vs. limit=15.0 2023-10-13 01:34:24,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1241034.6666666667, ans=0.0 2023-10-13 01:34:25,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1241034.6666666667, ans=0.125 2023-10-13 01:34:27,053 INFO [train.py:1031] (0/4) Epoch 20, batch 6500, loss[loss=0.1839, simple_loss=0.2637, pruned_loss=0.05203, over 15952.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2821, pruned_loss=0.05009, over 31526580.55 frames. ], batch size: 43, lr: 1.73e-03, grad_scale: 16.0 2023-10-13 01:34:42,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1241128.0, ans=0.0 2023-10-13 01:34:46,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=15.0 2023-10-13 01:34:52,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1241128.0, ans=0.125 2023-10-13 01:35:02,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1241174.6666666667, ans=0.2 2023-10-13 01:35:07,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1241174.6666666667, ans=0.1 2023-10-13 01:35:23,708 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-10-13 01:35:28,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1241268.0, ans=0.0 2023-10-13 01:35:36,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1241268.0, ans=0.0 2023-10-13 01:35:36,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1241268.0, ans=0.2 2023-10-13 01:35:48,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1241314.6666666667, ans=0.125 2023-10-13 01:35:57,078 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.37 vs. limit=6.0 2023-10-13 01:36:14,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1241408.0, ans=0.0 2023-10-13 01:36:18,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1241454.6666666667, ans=0.125 2023-10-13 01:36:19,802 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:36:20,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-10-13 01:36:35,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.776e+02 1.978e+02 2.232e+02 2.664e+02, threshold=3.957e+02, percent-clipped=0.0 2023-10-13 01:36:42,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1241548.0, ans=0.0 2023-10-13 01:36:45,074 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.72 vs. limit=12.0 2023-10-13 01:38:12,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.89 vs. limit=10.0 2023-10-13 01:38:37,030 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:38:38,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1241968.0, ans=0.0 2023-10-13 01:38:44,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1241968.0, ans=0.125 2023-10-13 01:38:45,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1241968.0, ans=0.2 2023-10-13 01:38:45,840 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.751e+02 1.928e+02 2.102e+02 3.317e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-13 01:38:51,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1242014.6666666667, ans=0.09899494936611666 2023-10-13 01:39:25,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1242108.0, ans=0.125 2023-10-13 01:40:01,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1242248.0, ans=0.0 2023-10-13 01:40:06,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1242294.6666666667, ans=0.125 2023-10-13 01:40:11,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1242294.6666666667, ans=0.125 2023-10-13 01:40:38,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1242388.0, ans=0.2 2023-10-13 01:40:41,411 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=22.5 2023-10-13 01:40:42,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1242388.0, ans=0.2 2023-10-13 01:40:52,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.682e+02 1.832e+02 2.087e+02 3.465e+02, threshold=3.663e+02, percent-clipped=0.0 2023-10-13 01:41:03,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1242481.3333333333, ans=0.1 2023-10-13 01:41:06,922 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-10-13 01:41:23,391 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.28 vs. limit=22.5 2023-10-13 01:42:05,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1242668.0, ans=0.125 2023-10-13 01:42:10,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1242714.6666666667, ans=0.0 2023-10-13 01:42:26,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1242761.3333333333, ans=0.1 2023-10-13 01:42:53,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1242854.6666666667, ans=0.125 2023-10-13 01:42:53,545 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.34 vs. limit=22.5 2023-10-13 01:43:06,075 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.18 vs. limit=22.5 2023-10-13 01:43:13,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.709e+02 1.916e+02 2.092e+02 3.062e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-13 01:43:19,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1242948.0, ans=0.125 2023-10-13 01:43:22,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1242948.0, ans=0.95 2023-10-13 01:43:25,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1242948.0, ans=0.0 2023-10-13 01:43:46,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.79 vs. limit=15.0 2023-10-13 01:44:40,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=1243228.0, ans=12.0 2023-10-13 01:44:47,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1243228.0, ans=0.0 2023-10-13 01:44:50,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.49 vs. limit=15.0 2023-10-13 01:45:00,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1243274.6666666667, ans=0.125 2023-10-13 01:45:15,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.80 vs. limit=10.0 2023-10-13 01:45:16,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1243368.0, ans=0.125 2023-10-13 01:45:19,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.860e+02 2.126e+02 2.378e+02 2.890e+02, threshold=4.252e+02, percent-clipped=0.0 2023-10-13 01:45:24,832 INFO [train.py:1031] (0/4) Epoch 20, batch 7000, loss[loss=0.1861, simple_loss=0.275, pruned_loss=0.04858, over 16927.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2825, pruned_loss=0.04991, over 31837630.89 frames. ], batch size: 138, lr: 1.73e-03, grad_scale: 32.0 2023-10-13 01:45:46,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-13 01:45:58,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1243508.0, ans=0.125 2023-10-13 01:46:20,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1243601.3333333333, ans=0.125 2023-10-13 01:46:48,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1243694.6666666667, ans=0.0 2023-10-13 01:47:18,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1243834.6666666667, ans=0.125 2023-10-13 01:47:22,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.773e+02 1.905e+02 2.110e+02 2.844e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-13 01:47:25,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1243834.6666666667, ans=0.2 2023-10-13 01:47:43,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1243928.0, ans=0.2 2023-10-13 01:48:09,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1244021.3333333333, ans=0.0 2023-10-13 01:48:21,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1244068.0, ans=0.125 2023-10-13 01:48:30,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-10-13 01:48:32,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1244114.6666666667, ans=0.125 2023-10-13 01:48:33,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.70 vs. limit=15.0 2023-10-13 01:48:36,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1244114.6666666667, ans=0.125 2023-10-13 01:48:38,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1244161.3333333333, ans=0.0 2023-10-13 01:48:39,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1244161.3333333333, ans=0.1 2023-10-13 01:48:57,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1244208.0, ans=0.125 2023-10-13 01:49:20,319 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-10-13 01:49:22,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.831e+02 2.004e+02 2.193e+02 3.259e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-13 01:49:34,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1244348.0, ans=0.0 2023-10-13 01:49:42,512 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:49:59,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1244394.6666666667, ans=0.125 2023-10-13 01:50:07,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=1244441.3333333333, ans=0.2 2023-10-13 01:50:26,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1244488.0, ans=0.0 2023-10-13 01:50:39,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1244534.6666666667, ans=0.05 2023-10-13 01:51:01,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1244628.0, ans=0.0 2023-10-13 01:51:13,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1244628.0, ans=0.125 2023-10-13 01:51:26,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1244721.3333333333, ans=0.125 2023-10-13 01:51:27,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1244721.3333333333, ans=0.0 2023-10-13 01:51:27,974 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-10-13 01:51:28,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1244721.3333333333, ans=0.125 2023-10-13 01:51:46,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.699e+02 1.864e+02 2.091e+02 3.274e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-13 01:51:50,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1244814.6666666667, ans=0.1 2023-10-13 01:52:06,695 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.22 vs. limit=15.0 2023-10-13 01:52:15,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1244861.3333333333, ans=0.0 2023-10-13 01:52:22,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1244908.0, ans=0.125 2023-10-13 01:52:27,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1244908.0, ans=0.2 2023-10-13 01:52:27,552 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=22.5 2023-10-13 01:52:36,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1244954.6666666667, ans=0.1 2023-10-13 01:52:47,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1245001.3333333333, ans=0.125 2023-10-13 01:53:30,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1245141.3333333333, ans=0.125 2023-10-13 01:53:49,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1245188.0, ans=0.1 2023-10-13 01:53:55,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1245234.6666666667, ans=0.0 2023-10-13 01:53:55,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1245234.6666666667, ans=0.125 2023-10-13 01:53:57,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.737e+02 1.886e+02 2.129e+02 2.813e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-13 01:53:58,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1245234.6666666667, ans=0.1 2023-10-13 01:54:06,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.87 vs. limit=6.0 2023-10-13 01:54:42,780 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:54:45,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1245421.3333333333, ans=0.125 2023-10-13 01:54:46,022 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=15.0 2023-10-13 01:54:50,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.14 vs. limit=12.0 2023-10-13 01:55:01,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1245468.0, ans=0.1 2023-10-13 01:55:01,882 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-13 01:55:52,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1245701.3333333333, ans=0.125 2023-10-13 01:55:59,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.791e+02 1.965e+02 2.184e+02 2.907e+02, threshold=3.931e+02, percent-clipped=0.0 2023-10-13 01:56:04,523 INFO [train.py:1031] (0/4) Epoch 20, batch 7500, loss[loss=0.2606, simple_loss=0.3232, pruned_loss=0.099, over 15577.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2822, pruned_loss=0.05001, over 32026881.09 frames. ], batch size: 350, lr: 1.73e-03, grad_scale: 16.0 2023-10-13 01:56:15,717 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:56:30,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1245841.3333333333, ans=0.0 2023-10-13 01:56:35,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.87 vs. limit=10.0 2023-10-13 01:56:35,876 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:56:43,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=1245888.0, ans=0.05 2023-10-13 01:56:50,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1245888.0, ans=0.1 2023-10-13 01:57:13,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1245981.3333333333, ans=0.125 2023-10-13 01:57:15,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1245981.3333333333, ans=0.125 2023-10-13 01:57:17,352 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-10-13 01:57:21,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1246028.0, ans=0.0 2023-10-13 01:57:32,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1246074.6666666667, ans=0.0 2023-10-13 01:57:35,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-13 01:57:40,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1246074.6666666667, ans=0.125 2023-10-13 01:57:45,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1246121.3333333333, ans=0.0 2023-10-13 01:58:02,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.672e+02 1.852e+02 2.071e+02 2.685e+02, threshold=3.704e+02, percent-clipped=0.0 2023-10-13 01:58:02,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1246168.0, ans=0.1 2023-10-13 01:58:16,224 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-10-13 01:58:17,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1246261.3333333333, ans=0.125 2023-10-13 01:58:37,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1246308.0, ans=0.125 2023-10-13 01:58:45,146 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 01:58:49,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1246354.6666666667, ans=0.1 2023-10-13 01:59:38,559 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.62 vs. limit=15.0 2023-10-13 01:59:42,583 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:00:10,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1246634.6666666667, ans=0.125 2023-10-13 02:00:14,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.762e+02 1.948e+02 2.201e+02 2.777e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-13 02:00:15,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1246634.6666666667, ans=0.125 2023-10-13 02:00:27,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1246681.3333333333, ans=0.125 2023-10-13 02:00:30,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1246728.0, ans=0.1 2023-10-13 02:00:48,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1246774.6666666667, ans=0.125 2023-10-13 02:01:05,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1246821.3333333333, ans=0.09899494936611666 2023-10-13 02:01:06,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1246821.3333333333, ans=0.125 2023-10-13 02:01:44,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1247008.0, ans=0.0 2023-10-13 02:02:05,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1247054.6666666667, ans=0.125 2023-10-13 02:02:05,181 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:02:15,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1247101.3333333333, ans=0.0 2023-10-13 02:02:18,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.818e+02 1.994e+02 2.287e+02 3.379e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-13 02:02:29,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1247148.0, ans=0.0 2023-10-13 02:03:20,018 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.28 vs. limit=10.0 2023-10-13 02:03:26,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1247334.6666666667, ans=0.0 2023-10-13 02:03:41,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1247381.3333333333, ans=15.0 2023-10-13 02:04:33,248 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.765e+02 1.894e+02 2.122e+02 3.118e+02, threshold=3.787e+02, percent-clipped=0.0 2023-10-13 02:04:53,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1247661.3333333333, ans=0.0 2023-10-13 02:04:54,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1247661.3333333333, ans=0.0 2023-10-13 02:05:00,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1247708.0, ans=0.0 2023-10-13 02:05:00,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1247708.0, ans=0.125 2023-10-13 02:05:15,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1247754.6666666667, ans=0.09899494936611666 2023-10-13 02:05:27,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1247801.3333333333, ans=0.125 2023-10-13 02:05:34,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-10-13 02:05:37,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1247801.3333333333, ans=0.0 2023-10-13 02:05:46,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1247848.0, ans=0.1 2023-10-13 02:06:41,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.664e+02 1.829e+02 2.080e+02 2.499e+02, threshold=3.659e+02, percent-clipped=0.0 2023-10-13 02:06:44,593 INFO [train.py:1031] (0/4) Epoch 20, batch 8000, loss[loss=0.2018, simple_loss=0.2981, pruned_loss=0.05273, over 16848.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2818, pruned_loss=0.04961, over 32203031.63 frames. ], batch size: 146, lr: 1.73e-03, grad_scale: 32.0 2023-10-13 02:06:45,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1248081.3333333333, ans=0.0 2023-10-13 02:06:55,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1248081.3333333333, ans=0.2 2023-10-13 02:07:18,713 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.93 vs. limit=22.5 2023-10-13 02:07:27,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1248221.3333333333, ans=0.125 2023-10-13 02:07:42,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1248314.6666666667, ans=0.125 2023-10-13 02:08:03,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1248408.0, ans=0.125 2023-10-13 02:08:33,470 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.781e+02 2.006e+02 2.355e+02 3.461e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-13 02:08:45,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-10-13 02:08:54,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-10-13 02:08:56,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1248594.6666666667, ans=0.0 2023-10-13 02:09:12,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1248688.0, ans=0.1 2023-10-13 02:09:18,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1248688.0, ans=0.125 2023-10-13 02:09:29,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1248734.6666666667, ans=0.125 2023-10-13 02:09:33,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1248734.6666666667, ans=0.125 2023-10-13 02:09:58,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1248828.0, ans=0.125 2023-10-13 02:10:00,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1248828.0, ans=0.1 2023-10-13 02:10:08,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1248828.0, ans=0.015 2023-10-13 02:10:16,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1248874.6666666667, ans=0.07 2023-10-13 02:10:27,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.60 vs. limit=15.0 2023-10-13 02:10:55,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.706e+02 1.874e+02 2.107e+02 3.729e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-13 02:10:56,752 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.62 vs. limit=15.0 2023-10-13 02:11:19,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1249061.3333333333, ans=0.125 2023-10-13 02:11:24,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.89 vs. limit=22.5 2023-10-13 02:11:34,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1249154.6666666667, ans=0.1 2023-10-13 02:11:40,876 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-10-13 02:12:04,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1249248.0, ans=0.5 2023-10-13 02:12:34,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1249341.3333333333, ans=0.1 2023-10-13 02:12:35,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1249341.3333333333, ans=0.0 2023-10-13 02:12:44,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1249388.0, ans=0.0 2023-10-13 02:13:01,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.741e+02 1.945e+02 2.141e+02 3.420e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-13 02:13:19,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1249528.0, ans=0.125 2023-10-13 02:13:53,109 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:13:54,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1249668.0, ans=0.0 2023-10-13 02:14:24,818 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.88 vs. limit=15.0 2023-10-13 02:14:27,103 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.87 vs. limit=15.0 2023-10-13 02:14:29,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1249808.0, ans=0.2 2023-10-13 02:14:31,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1249808.0, ans=0.125 2023-10-13 02:14:35,152 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:14:50,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1249901.3333333333, ans=0.125 2023-10-13 02:14:50,644 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2023-10-13 02:14:55,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1249901.3333333333, ans=0.0 2023-10-13 02:14:58,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.781e+02 1.902e+02 2.167e+02 3.193e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-13 02:15:02,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1249948.0, ans=0.0 2023-10-13 02:15:03,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1249948.0, ans=0.0 2023-10-13 02:15:05,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1249948.0, ans=0.5 2023-10-13 02:15:19,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1249994.6666666667, ans=0.0 2023-10-13 02:15:23,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1249994.6666666667, ans=0.0 2023-10-13 02:15:41,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1250088.0, ans=0.125 2023-10-13 02:15:49,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1250088.0, ans=0.1 2023-10-13 02:16:37,175 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-10-13 02:16:51,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1250321.3333333333, ans=0.125 2023-10-13 02:16:54,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1250321.3333333333, ans=0.125 2023-10-13 02:16:57,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1250368.0, ans=0.125 2023-10-13 02:17:13,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.729e+02 1.880e+02 2.122e+02 2.886e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-13 02:17:14,499 INFO [train.py:1031] (0/4) Epoch 20, batch 8500, loss[loss=0.1871, simple_loss=0.2867, pruned_loss=0.04378, over 16978.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2823, pruned_loss=0.04973, over 32317512.09 frames. ], batch size: 93, lr: 1.73e-03, grad_scale: 16.0 2023-10-13 02:17:45,795 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.32 vs. limit=10.0 2023-10-13 02:17:51,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1250508.0, ans=0.07 2023-10-13 02:17:54,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1250554.6666666667, ans=0.0 2023-10-13 02:17:59,234 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:18:13,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-10-13 02:18:25,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1250648.0, ans=0.125 2023-10-13 02:18:31,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1250694.6666666667, ans=0.0 2023-10-13 02:18:39,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1250694.6666666667, ans=0.125 2023-10-13 02:18:55,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1250788.0, ans=0.0 2023-10-13 02:18:58,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-10-13 02:19:03,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1250788.0, ans=0.0 2023-10-13 02:19:20,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1250834.6666666667, ans=0.125 2023-10-13 02:19:23,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.818e+02 2.012e+02 2.364e+02 3.735e+02, threshold=4.024e+02, percent-clipped=0.0 2023-10-13 02:19:29,770 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-10-13 02:19:47,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1250928.0, ans=0.2 2023-10-13 02:20:28,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1251114.6666666667, ans=0.125 2023-10-13 02:20:35,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1251114.6666666667, ans=0.125 2023-10-13 02:21:21,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1251254.6666666667, ans=0.125 2023-10-13 02:21:35,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1251301.3333333333, ans=0.1 2023-10-13 02:21:38,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.737e+02 1.932e+02 2.242e+02 3.015e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-13 02:22:05,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1251441.3333333333, ans=0.0 2023-10-13 02:22:27,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.53 vs. limit=10.0 2023-10-13 02:22:31,455 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-10-13 02:22:38,026 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:22:52,653 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.67 vs. limit=6.0 2023-10-13 02:23:02,829 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-10-13 02:23:27,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.40 vs. limit=15.0 2023-10-13 02:23:55,575 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.716e+02 1.886e+02 2.189e+02 3.700e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-13 02:24:03,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1251814.6666666667, ans=0.1 2023-10-13 02:24:56,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-10-13 02:24:57,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-10-13 02:25:14,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1252094.6666666667, ans=0.125 2023-10-13 02:25:24,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=12.0 2023-10-13 02:25:31,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1252188.0, ans=0.125 2023-10-13 02:25:31,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1252188.0, ans=0.0 2023-10-13 02:25:45,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1252234.6666666667, ans=0.2 2023-10-13 02:25:59,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.795e+02 1.967e+02 2.153e+02 3.302e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-13 02:26:14,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.23 vs. limit=22.5 2023-10-13 02:26:17,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1252328.0, ans=0.0 2023-10-13 02:26:54,052 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:26:56,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1252514.6666666667, ans=0.125 2023-10-13 02:26:56,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1252514.6666666667, ans=0.1 2023-10-13 02:27:07,810 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=22.5 2023-10-13 02:27:38,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1252654.6666666667, ans=0.1 2023-10-13 02:27:54,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.818e+02 1.957e+02 2.206e+02 3.756e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-13 02:27:55,018 INFO [train.py:1031] (0/4) Epoch 20, batch 9000, loss[loss=0.1901, simple_loss=0.2914, pruned_loss=0.04437, over 16859.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2816, pruned_loss=0.04941, over 32438309.67 frames. ], batch size: 98, lr: 1.72e-03, grad_scale: 16.0 2023-10-13 02:28:04,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1252748.0, ans=0.125 2023-10-13 02:28:48,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.49 vs. limit=15.0 2023-10-13 02:28:57,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1252981.3333333333, ans=0.1 2023-10-13 02:28:58,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1252981.3333333333, ans=0.125 2023-10-13 02:29:04,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1253028.0, ans=0.125 2023-10-13 02:29:05,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1253028.0, ans=0.125 2023-10-13 02:29:22,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1253074.6666666667, ans=0.2 2023-10-13 02:29:27,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1253121.3333333333, ans=0.0 2023-10-13 02:29:52,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.755e+02 1.935e+02 2.151e+02 3.003e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-13 02:30:02,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1253214.6666666667, ans=0.125 2023-10-13 02:30:14,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1253261.3333333333, ans=0.0 2023-10-13 02:30:27,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1253308.0, ans=0.125 2023-10-13 02:30:36,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1253354.6666666667, ans=0.125 2023-10-13 02:30:36,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1253354.6666666667, ans=0.0 2023-10-13 02:30:41,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1253401.3333333333, ans=0.0 2023-10-13 02:30:48,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1253401.3333333333, ans=0.2 2023-10-13 02:31:02,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.82 vs. limit=12.0 2023-10-13 02:31:09,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1253494.6666666667, ans=0.04949747468305833 2023-10-13 02:31:12,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1253494.6666666667, ans=0.0 2023-10-13 02:31:19,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1253541.3333333333, ans=0.0 2023-10-13 02:31:33,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1253588.0, ans=0.0 2023-10-13 02:31:41,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1253634.6666666667, ans=0.125 2023-10-13 02:31:55,291 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.809e+02 2.009e+02 2.263e+02 3.190e+02, threshold=4.017e+02, percent-clipped=0.0 2023-10-13 02:32:32,198 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-13 02:32:33,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1253821.3333333333, ans=0.125 2023-10-13 02:32:37,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1253821.3333333333, ans=0.035 2023-10-13 02:32:38,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-10-13 02:33:02,030 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-10-13 02:33:21,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1254008.0, ans=0.125 2023-10-13 02:33:29,479 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:33:31,491 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-10-13 02:33:36,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1254054.6666666667, ans=0.125 2023-10-13 02:33:37,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1254054.6666666667, ans=0.0 2023-10-13 02:33:45,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1254101.3333333333, ans=0.125 2023-10-13 02:33:56,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.877e+02 2.108e+02 2.430e+02 3.193e+02, threshold=4.215e+02, percent-clipped=0.0 2023-10-13 02:34:04,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1254194.6666666667, ans=0.2 2023-10-13 02:34:23,666 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2023-10-13 02:35:06,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1254381.3333333333, ans=0.0 2023-10-13 02:35:27,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-10-13 02:35:37,850 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:35:55,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1254568.0, ans=0.125 2023-10-13 02:35:59,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1254568.0, ans=0.0 2023-10-13 02:36:00,564 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=4.202e-02 2023-10-13 02:36:11,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.796e+02 1.954e+02 2.171e+02 2.877e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-13 02:37:03,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.12 vs. limit=15.0 2023-10-13 02:37:55,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1254941.3333333333, ans=0.0 2023-10-13 02:38:02,367 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.12 vs. limit=15.0 2023-10-13 02:38:25,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1255034.6666666667, ans=0.125 2023-10-13 02:38:33,073 INFO [train.py:1031] (0/4) Epoch 20, batch 9500, loss[loss=0.1812, simple_loss=0.2831, pruned_loss=0.03963, over 16861.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2824, pruned_loss=0.0496, over 32516170.96 frames. ], batch size: 98, lr: 1.72e-03, grad_scale: 16.0 2023-10-13 02:38:34,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.781e+02 1.951e+02 2.224e+02 2.931e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-13 02:38:43,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1255081.3333333333, ans=0.1 2023-10-13 02:38:53,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1255128.0, ans=0.2 2023-10-13 02:39:39,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1255314.6666666667, ans=0.125 2023-10-13 02:39:41,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1255314.6666666667, ans=0.0 2023-10-13 02:39:55,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1255361.3333333333, ans=0.125 2023-10-13 02:39:57,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1255361.3333333333, ans=0.07 2023-10-13 02:40:27,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1255454.6666666667, ans=0.125 2023-10-13 02:40:43,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.736e+02 1.942e+02 2.299e+02 5.234e+02, threshold=3.884e+02, percent-clipped=2.0 2023-10-13 02:41:18,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1255641.3333333333, ans=0.0 2023-10-13 02:41:23,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1255688.0, ans=0.0 2023-10-13 02:41:28,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1255688.0, ans=0.07 2023-10-13 02:41:33,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.42 vs. limit=22.5 2023-10-13 02:41:43,722 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:41:46,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255781.3333333333, ans=0.1 2023-10-13 02:41:46,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1255781.3333333333, ans=0.125 2023-10-13 02:41:47,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=10.28 vs. limit=12.0 2023-10-13 02:41:54,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1255828.0, ans=0.0 2023-10-13 02:41:57,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1255828.0, ans=0.2 2023-10-13 02:42:04,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1255828.0, ans=0.125 2023-10-13 02:42:12,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1255874.6666666667, ans=0.125 2023-10-13 02:42:24,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1255921.3333333333, ans=0.2 2023-10-13 02:42:37,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1255968.0, ans=0.05 2023-10-13 02:42:46,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.735e+02 1.873e+02 2.068e+02 2.626e+02, threshold=3.746e+02, percent-clipped=0.0 2023-10-13 02:42:50,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1256014.6666666667, ans=0.125 2023-10-13 02:42:57,871 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.92 vs. limit=10.0 2023-10-13 02:43:12,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.69 vs. limit=12.0 2023-10-13 02:43:24,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=12.0 2023-10-13 02:43:49,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1256248.0, ans=0.0 2023-10-13 02:44:02,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1256294.6666666667, ans=0.125 2023-10-13 02:44:07,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1256294.6666666667, ans=0.0 2023-10-13 02:44:07,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1256294.6666666667, ans=0.125 2023-10-13 02:44:16,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1256341.3333333333, ans=0.2 2023-10-13 02:44:21,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1256341.3333333333, ans=0.0 2023-10-13 02:44:33,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1256388.0, ans=0.125 2023-10-13 02:44:47,026 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.26 vs. limit=10.0 2023-10-13 02:44:51,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1256434.6666666667, ans=0.125 2023-10-13 02:44:59,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.721e+02 1.903e+02 2.089e+02 2.973e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-13 02:45:14,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1256528.0, ans=0.0 2023-10-13 02:45:21,071 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.78 vs. limit=22.5 2023-10-13 02:45:21,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1256528.0, ans=0.0 2023-10-13 02:45:24,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1256574.6666666667, ans=0.125 2023-10-13 02:45:34,809 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.97 vs. limit=22.5 2023-10-13 02:45:53,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1256668.0, ans=0.125 2023-10-13 02:46:09,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1256714.6666666667, ans=0.0 2023-10-13 02:46:23,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1256761.3333333333, ans=0.0 2023-10-13 02:46:30,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1256761.3333333333, ans=0.0 2023-10-13 02:47:08,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1256901.3333333333, ans=0.125 2023-10-13 02:47:14,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.701e+02 1.818e+02 1.988e+02 2.797e+02, threshold=3.636e+02, percent-clipped=0.0 2023-10-13 02:47:22,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1256948.0, ans=0.0 2023-10-13 02:47:27,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1256994.6666666667, ans=0.125 2023-10-13 02:47:35,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1256994.6666666667, ans=0.1 2023-10-13 02:48:02,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1257088.0, ans=0.1 2023-10-13 02:48:07,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1257134.6666666667, ans=0.5 2023-10-13 02:48:25,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1257181.3333333333, ans=15.0 2023-10-13 02:48:30,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1257228.0, ans=0.0 2023-10-13 02:48:45,987 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.11 vs. limit=22.5 2023-10-13 02:48:49,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1257274.6666666667, ans=0.0 2023-10-13 02:48:50,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1257274.6666666667, ans=0.125 2023-10-13 02:48:57,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1257321.3333333333, ans=0.125 2023-10-13 02:49:07,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1257321.3333333333, ans=0.2 2023-10-13 02:49:20,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1257368.0, ans=0.125 2023-10-13 02:49:21,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1257414.6666666667, ans=0.1 2023-10-13 02:49:22,227 INFO [train.py:1031] (0/4) Epoch 20, batch 10000, loss[loss=0.1876, simple_loss=0.2729, pruned_loss=0.05114, over 16376.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2815, pruned_loss=0.04927, over 32585823.83 frames. ], batch size: 50, lr: 1.72e-03, grad_scale: 32.0 2023-10-13 02:49:24,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.758e+02 2.009e+02 2.264e+02 3.112e+02, threshold=4.017e+02, percent-clipped=0.0 2023-10-13 02:49:30,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1257414.6666666667, ans=0.125 2023-10-13 02:50:15,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1257554.6666666667, ans=0.125 2023-10-13 02:50:41,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1257648.0, ans=0.0 2023-10-13 02:50:53,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1257694.6666666667, ans=0.125 2023-10-13 02:51:00,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.99 vs. limit=10.0 2023-10-13 02:51:03,031 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.97 vs. limit=15.0 2023-10-13 02:51:11,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1257741.3333333333, ans=0.07 2023-10-13 02:51:21,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1257788.0, ans=0.125 2023-10-13 02:51:37,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.11 vs. limit=10.0 2023-10-13 02:51:44,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.775e+02 1.903e+02 2.144e+02 4.038e+02, threshold=3.807e+02, percent-clipped=1.0 2023-10-13 02:51:48,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1257881.3333333333, ans=0.125 2023-10-13 02:52:17,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1257974.6666666667, ans=0.125 2023-10-13 02:52:17,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1257974.6666666667, ans=0.125 2023-10-13 02:52:49,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1258068.0, ans=0.035 2023-10-13 02:52:57,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1258068.0, ans=0.125 2023-10-13 02:53:07,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1258114.6666666667, ans=0.0 2023-10-13 02:53:07,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1258114.6666666667, ans=0.0 2023-10-13 02:53:17,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1258114.6666666667, ans=0.125 2023-10-13 02:53:57,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1258254.6666666667, ans=0.125 2023-10-13 02:54:00,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1258254.6666666667, ans=0.125 2023-10-13 02:54:10,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1258301.3333333333, ans=0.125 2023-10-13 02:54:19,738 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.740e+02 1.872e+02 2.130e+02 2.834e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-13 02:54:22,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1258348.0, ans=0.2 2023-10-13 02:54:29,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1258348.0, ans=0.125 2023-10-13 02:54:54,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1258441.3333333333, ans=0.0 2023-10-13 02:54:58,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1258441.3333333333, ans=0.125 2023-10-13 02:55:25,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1258534.6666666667, ans=0.0 2023-10-13 02:56:38,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1258768.0, ans=0.0 2023-10-13 02:56:42,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.807e+02 1.991e+02 2.225e+02 3.102e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-13 02:56:44,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1258814.6666666667, ans=0.125 2023-10-13 02:56:59,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1258861.3333333333, ans=0.2 2023-10-13 02:57:19,629 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 02:57:30,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1258954.6666666667, ans=0.125 2023-10-13 02:57:46,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1259001.3333333333, ans=0.0 2023-10-13 02:57:48,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1259001.3333333333, ans=0.125 2023-10-13 02:57:52,352 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.05 vs. limit=15.0 2023-10-13 02:58:37,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1259188.0, ans=0.05 2023-10-13 02:59:01,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1259234.6666666667, ans=0.1 2023-10-13 02:59:07,101 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.780e+02 1.945e+02 2.203e+02 3.487e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-13 02:59:22,934 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.45 vs. limit=15.0 2023-10-13 02:59:28,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.42 vs. limit=15.0 2023-10-13 02:59:30,042 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-10-13 02:59:35,419 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.16 vs. limit=15.0 2023-10-13 03:00:18,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1259468.0, ans=0.125 2023-10-13 03:00:27,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1259514.6666666667, ans=0.1 2023-10-13 03:00:29,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1259514.6666666667, ans=0.125 2023-10-13 03:00:36,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=12.0 2023-10-13 03:00:41,987 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:00:48,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1259608.0, ans=0.0 2023-10-13 03:01:27,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1259701.3333333333, ans=0.125 2023-10-13 03:01:34,784 INFO [train.py:1031] (0/4) Epoch 20, batch 10500, loss[loss=0.179, simple_loss=0.28, pruned_loss=0.03901, over 16881.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.282, pruned_loss=0.04935, over 32658726.26 frames. ], batch size: 104, lr: 1.72e-03, grad_scale: 16.0 2023-10-13 03:01:37,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.735e+02 1.885e+02 2.110e+02 3.035e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-13 03:02:20,275 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.10 vs. limit=15.0 2023-10-13 03:02:34,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1259934.6666666667, ans=0.2 2023-10-13 03:02:43,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1259981.3333333333, ans=0.125 2023-10-13 03:02:58,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1260028.0, ans=0.125 2023-10-13 03:03:30,987 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:03:59,756 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=15.0 2023-10-13 03:04:08,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.717e+02 1.874e+02 2.102e+02 2.853e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 03:04:19,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1260214.6666666667, ans=0.0 2023-10-13 03:04:50,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1260308.0, ans=0.125 2023-10-13 03:04:51,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1260308.0, ans=0.2 2023-10-13 03:04:51,997 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.40 vs. limit=10.0 2023-10-13 03:05:26,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1260448.0, ans=0.0 2023-10-13 03:06:07,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1260541.3333333333, ans=0.125 2023-10-13 03:06:15,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1260588.0, ans=0.0 2023-10-13 03:06:53,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.709e+02 1.859e+02 2.065e+02 3.114e+02, threshold=3.717e+02, percent-clipped=0.0 2023-10-13 03:07:01,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1260681.3333333333, ans=0.09899494936611666 2023-10-13 03:07:19,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1260774.6666666667, ans=0.09899494936611666 2023-10-13 03:08:56,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1261054.6666666667, ans=0.125 2023-10-13 03:08:56,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1261054.6666666667, ans=0.09899494936611666 2023-10-13 03:09:21,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1261101.3333333333, ans=0.1 2023-10-13 03:09:27,332 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.356e+02 1.764e+02 1.911e+02 2.103e+02 2.798e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-13 03:09:31,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1261148.0, ans=0.0 2023-10-13 03:09:56,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1261241.3333333333, ans=0.125 2023-10-13 03:10:02,720 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.15 vs. limit=22.5 2023-10-13 03:10:29,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1261334.6666666667, ans=0.1 2023-10-13 03:10:46,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1261381.3333333333, ans=0.0 2023-10-13 03:10:58,828 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.21 vs. limit=15.0 2023-10-13 03:11:12,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1261474.6666666667, ans=0.025 2023-10-13 03:11:14,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1261474.6666666667, ans=0.2 2023-10-13 03:11:27,582 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=22.5 2023-10-13 03:11:29,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1261521.3333333333, ans=0.2 2023-10-13 03:11:54,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.278e+02 1.716e+02 1.869e+02 2.080e+02 2.725e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-13 03:12:37,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1261708.0, ans=0.125 2023-10-13 03:12:55,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1261801.3333333333, ans=0.125 2023-10-13 03:13:04,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1261801.3333333333, ans=0.05 2023-10-13 03:13:07,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-10-13 03:13:10,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1261848.0, ans=0.1 2023-10-13 03:13:14,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1261848.0, ans=0.0 2023-10-13 03:13:19,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1261848.0, ans=0.0 2023-10-13 03:13:37,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1261941.3333333333, ans=0.125 2023-10-13 03:13:48,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1261941.3333333333, ans=0.125 2023-10-13 03:13:53,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1261988.0, ans=0.0 2023-10-13 03:14:07,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.39 vs. limit=15.0 2023-10-13 03:14:13,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1262034.6666666667, ans=0.0 2023-10-13 03:14:25,561 INFO [train.py:1031] (0/4) Epoch 20, batch 11000, loss[loss=0.1898, simple_loss=0.2913, pruned_loss=0.04418, over 16839.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.282, pruned_loss=0.04951, over 32660124.79 frames. ], batch size: 87, lr: 1.72e-03, grad_scale: 32.0 2023-10-13 03:14:28,578 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.773e+02 1.923e+02 2.227e+02 3.061e+02, threshold=3.846e+02, percent-clipped=0.0 2023-10-13 03:14:32,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1262081.3333333333, ans=0.1 2023-10-13 03:14:37,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1262081.3333333333, ans=0.0 2023-10-13 03:14:43,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1262128.0, ans=0.125 2023-10-13 03:15:07,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1262174.6666666667, ans=0.5 2023-10-13 03:15:07,830 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:15:13,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1262221.3333333333, ans=0.0 2023-10-13 03:15:19,045 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.27 vs. limit=22.5 2023-10-13 03:15:40,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1262314.6666666667, ans=0.125 2023-10-13 03:15:43,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1262314.6666666667, ans=0.2 2023-10-13 03:15:44,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1262314.6666666667, ans=0.125 2023-10-13 03:16:05,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1262361.3333333333, ans=0.1 2023-10-13 03:16:17,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1262408.0, ans=0.0 2023-10-13 03:16:35,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1262454.6666666667, ans=0.125 2023-10-13 03:16:54,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1262501.3333333333, ans=0.125 2023-10-13 03:17:04,878 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.01 vs. limit=10.0 2023-10-13 03:17:06,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.789e+02 1.986e+02 2.277e+02 3.191e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-13 03:18:15,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1262734.6666666667, ans=0.125 2023-10-13 03:18:16,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1262734.6666666667, ans=0.025 2023-10-13 03:18:45,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-10-13 03:19:02,176 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-10-13 03:19:32,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1262968.0, ans=0.125 2023-10-13 03:19:35,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1262968.0, ans=0.1 2023-10-13 03:19:36,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1262968.0, ans=0.125 2023-10-13 03:19:56,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.789e+02 1.996e+02 2.267e+02 3.207e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 03:20:27,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1263108.0, ans=0.125 2023-10-13 03:20:34,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1263154.6666666667, ans=0.0 2023-10-13 03:20:41,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1263154.6666666667, ans=0.125 2023-10-13 03:20:49,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.64 vs. limit=15.0 2023-10-13 03:20:50,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.53 vs. limit=15.0 2023-10-13 03:20:51,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1263201.3333333333, ans=0.125 2023-10-13 03:21:33,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.38 vs. limit=10.0 2023-10-13 03:21:33,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.76 vs. limit=15.0 2023-10-13 03:21:53,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1263388.0, ans=0.0 2023-10-13 03:21:57,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1263434.6666666667, ans=0.0 2023-10-13 03:22:09,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1263481.3333333333, ans=0.07 2023-10-13 03:22:13,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.706e+02 1.905e+02 2.091e+02 2.802e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-13 03:22:38,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1263574.6666666667, ans=0.0 2023-10-13 03:22:40,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1263574.6666666667, ans=0.2 2023-10-13 03:22:47,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.47 vs. limit=15.0 2023-10-13 03:22:51,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1263621.3333333333, ans=0.1 2023-10-13 03:22:56,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1263668.0, ans=0.125 2023-10-13 03:23:16,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.60 vs. limit=15.0 2023-10-13 03:23:31,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1263761.3333333333, ans=0.0 2023-10-13 03:23:39,168 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.54 vs. limit=12.0 2023-10-13 03:23:41,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1263808.0, ans=0.5 2023-10-13 03:24:13,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.770e+02 1.930e+02 2.181e+02 3.103e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-13 03:24:27,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1263994.6666666667, ans=0.125 2023-10-13 03:24:27,656 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.29 vs. limit=15.0 2023-10-13 03:25:03,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1264134.6666666667, ans=0.0 2023-10-13 03:25:13,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1264134.6666666667, ans=0.2 2023-10-13 03:25:35,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1264228.0, ans=0.125 2023-10-13 03:25:39,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1264274.6666666667, ans=0.0 2023-10-13 03:26:01,791 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-10-13 03:26:13,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1264368.0, ans=0.0 2023-10-13 03:26:18,935 INFO [train.py:1031] (0/4) Epoch 20, batch 11500, loss[loss=0.1879, simple_loss=0.2885, pruned_loss=0.0436, over 16862.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2816, pruned_loss=0.04921, over 32710327.85 frames. ], batch size: 155, lr: 1.72e-03, grad_scale: 16.0 2023-10-13 03:26:23,575 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.873e+02 2.038e+02 2.286e+02 3.197e+02, threshold=4.076e+02, percent-clipped=0.0 2023-10-13 03:26:45,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.59 vs. limit=10.0 2023-10-13 03:26:51,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.95 vs. limit=22.5 2023-10-13 03:27:01,427 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.28 vs. limit=15.0 2023-10-13 03:27:08,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.71 vs. limit=10.0 2023-10-13 03:27:42,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1264741.3333333333, ans=0.125 2023-10-13 03:27:42,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.25 vs. limit=15.0 2023-10-13 03:28:15,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1264834.6666666667, ans=0.0 2023-10-13 03:28:24,064 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.27 vs. limit=15.0 2023-10-13 03:28:26,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1264881.3333333333, ans=0.125 2023-10-13 03:28:28,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.854e+02 2.159e+02 2.490e+02 6.222e+02, threshold=4.318e+02, percent-clipped=1.0 2023-10-13 03:28:29,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.49 vs. limit=15.0 2023-10-13 03:28:43,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1264974.6666666667, ans=0.0 2023-10-13 03:28:44,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1264974.6666666667, ans=0.125 2023-10-13 03:28:49,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1264974.6666666667, ans=10.0 2023-10-13 03:28:53,760 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.07 vs. limit=15.0 2023-10-13 03:29:15,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.52 vs. limit=15.0 2023-10-13 03:29:39,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1265161.3333333333, ans=0.125 2023-10-13 03:29:43,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1265208.0, ans=0.125 2023-10-13 03:29:53,737 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.43 vs. limit=15.0 2023-10-13 03:30:18,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.63 vs. limit=22.5 2023-10-13 03:30:22,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.804e+02 2.084e+02 2.368e+02 3.797e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-13 03:30:36,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1265394.6666666667, ans=0.0 2023-10-13 03:30:39,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1265441.3333333333, ans=0.125 2023-10-13 03:30:42,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1265441.3333333333, ans=0.04949747468305833 2023-10-13 03:30:50,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1265488.0, ans=0.0 2023-10-13 03:30:53,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1265488.0, ans=0.125 2023-10-13 03:31:15,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.70 vs. limit=15.0 2023-10-13 03:31:47,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1265628.0, ans=0.125 2023-10-13 03:32:04,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1265721.3333333333, ans=0.1 2023-10-13 03:32:28,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1265814.6666666667, ans=0.2 2023-10-13 03:32:35,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.814e+02 1.968e+02 2.114e+02 3.008e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-13 03:32:43,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1265861.3333333333, ans=0.2 2023-10-13 03:32:50,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1265861.3333333333, ans=0.125 2023-10-13 03:32:56,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1265908.0, ans=0.125 2023-10-13 03:33:36,513 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.15 vs. limit=22.5 2023-10-13 03:33:45,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1266094.6666666667, ans=0.1 2023-10-13 03:34:23,018 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.43 vs. limit=12.0 2023-10-13 03:34:29,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1266234.6666666667, ans=0.025 2023-10-13 03:34:42,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.756e+02 1.921e+02 2.146e+02 4.517e+02, threshold=3.842e+02, percent-clipped=1.0 2023-10-13 03:34:49,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1266328.0, ans=0.125 2023-10-13 03:35:06,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1266374.6666666667, ans=0.0 2023-10-13 03:35:09,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1266374.6666666667, ans=0.0 2023-10-13 03:35:17,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1266421.3333333333, ans=0.125 2023-10-13 03:35:37,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1266514.6666666667, ans=0.0 2023-10-13 03:35:49,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1266561.3333333333, ans=0.125 2023-10-13 03:35:57,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1266561.3333333333, ans=0.0 2023-10-13 03:36:00,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.01 vs. limit=22.5 2023-10-13 03:36:04,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1266608.0, ans=0.125 2023-10-13 03:36:12,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1266654.6666666667, ans=0.125 2023-10-13 03:36:16,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1266654.6666666667, ans=0.2 2023-10-13 03:36:33,960 INFO [train.py:1031] (0/4) Epoch 20, batch 12000, loss[loss=0.1706, simple_loss=0.268, pruned_loss=0.03656, over 16459.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2815, pruned_loss=0.04888, over 32748549.02 frames. ], batch size: 50, lr: 1.71e-03, grad_scale: 32.0 2023-10-13 03:36:34,765 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-10-13 03:36:40,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.820e+02 2.004e+02 2.272e+02 2.954e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-13 03:36:44,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1266748.0, ans=0.125 2023-10-13 03:36:52,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1266794.6666666667, ans=0.2 2023-10-13 03:36:56,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1266794.6666666667, ans=0.125 2023-10-13 03:36:57,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1266841.3333333333, ans=0.125 2023-10-13 03:36:59,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1266841.3333333333, ans=0.125 2023-10-13 03:37:02,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1266841.3333333333, ans=0.125 2023-10-13 03:37:12,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1266888.0, ans=0.125 2023-10-13 03:37:12,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-10-13 03:37:36,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1266981.3333333333, ans=0.125 2023-10-13 03:37:43,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1267028.0, ans=0.0 2023-10-13 03:37:45,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1267028.0, ans=0.0 2023-10-13 03:38:05,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1267074.6666666667, ans=0.125 2023-10-13 03:38:20,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-10-13 03:38:36,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.727e+02 2.014e+02 2.203e+02 3.512e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-13 03:38:52,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1267308.0, ans=0.125 2023-10-13 03:39:09,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1267354.6666666667, ans=0.125 2023-10-13 03:39:15,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1267401.3333333333, ans=0.125 2023-10-13 03:39:25,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1267448.0, ans=0.0 2023-10-13 03:39:50,309 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-10-13 03:39:56,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1267541.3333333333, ans=0.2 2023-10-13 03:39:56,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1267541.3333333333, ans=0.0 2023-10-13 03:40:01,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1267588.0, ans=0.0 2023-10-13 03:40:02,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.39 vs. limit=15.0 2023-10-13 03:40:11,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1267588.0, ans=0.125 2023-10-13 03:40:11,702 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.43 vs. limit=5.0 2023-10-13 03:40:25,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1267681.3333333333, ans=0.5 2023-10-13 03:40:28,580 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.768e+02 1.921e+02 2.102e+02 2.809e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-13 03:40:44,468 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.39 vs. limit=15.0 2023-10-13 03:41:06,762 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:41:06,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1267821.3333333333, ans=0.04949747468305833 2023-10-13 03:41:34,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1267961.3333333333, ans=0.0 2023-10-13 03:41:40,636 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=12.0 2023-10-13 03:41:43,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1267961.3333333333, ans=0.0 2023-10-13 03:41:52,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.53 vs. limit=15.0 2023-10-13 03:41:58,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1268054.6666666667, ans=0.125 2023-10-13 03:42:10,772 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.88 vs. limit=15.0 2023-10-13 03:42:19,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1268101.3333333333, ans=0.0 2023-10-13 03:42:26,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1268148.0, ans=0.1 2023-10-13 03:42:27,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.780e+02 1.925e+02 2.092e+02 3.220e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-13 03:42:28,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1268148.0, ans=0.0 2023-10-13 03:42:44,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1268241.3333333333, ans=0.125 2023-10-13 03:43:01,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1268288.0, ans=0.125 2023-10-13 03:43:06,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1268334.6666666667, ans=0.0 2023-10-13 03:43:06,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1268334.6666666667, ans=0.0 2023-10-13 03:43:44,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1268474.6666666667, ans=0.2 2023-10-13 03:44:09,611 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:44:25,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.754e+02 1.912e+02 2.107e+02 3.731e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-13 03:44:28,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1268614.6666666667, ans=0.2 2023-10-13 03:44:34,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1268661.3333333333, ans=0.2 2023-10-13 03:44:48,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1268708.0, ans=0.125 2023-10-13 03:44:53,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1268754.6666666667, ans=0.09899494936611666 2023-10-13 03:45:16,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1268848.0, ans=0.125 2023-10-13 03:45:18,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.82 vs. limit=15.0 2023-10-13 03:45:22,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1268848.0, ans=0.125 2023-10-13 03:45:54,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-10-13 03:46:06,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1269034.6666666667, ans=0.125 2023-10-13 03:46:14,583 INFO [train.py:1031] (0/4) Epoch 20, batch 12500, loss[loss=0.1925, simple_loss=0.2879, pruned_loss=0.0486, over 16945.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2811, pruned_loss=0.0489, over 32750383.12 frames. ], batch size: 117, lr: 1.71e-03, grad_scale: 16.0 2023-10-13 03:46:16,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1269081.3333333333, ans=0.125 2023-10-13 03:46:23,314 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.777e+02 1.884e+02 2.089e+02 2.813e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-13 03:46:34,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1269128.0, ans=0.1 2023-10-13 03:46:48,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1269221.3333333333, ans=0.125 2023-10-13 03:46:56,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1269221.3333333333, ans=0.0 2023-10-13 03:47:04,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1269268.0, ans=0.2 2023-10-13 03:47:12,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1269314.6666666667, ans=0.07 2023-10-13 03:47:13,045 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-272000.pt 2023-10-13 03:47:17,680 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-13 03:47:19,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1269314.6666666667, ans=0.125 2023-10-13 03:47:22,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1269314.6666666667, ans=0.0 2023-10-13 03:47:29,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1269361.3333333333, ans=0.125 2023-10-13 03:48:03,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1269501.3333333333, ans=0.125 2023-10-13 03:48:06,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1269501.3333333333, ans=0.125 2023-10-13 03:48:18,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.776e+02 2.009e+02 2.261e+02 3.564e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-13 03:48:19,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-10-13 03:48:24,284 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:48:27,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1269594.6666666667, ans=0.07 2023-10-13 03:48:33,609 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 03:48:57,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1269688.0, ans=0.1 2023-10-13 03:49:10,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1269734.6666666667, ans=0.1 2023-10-13 03:49:25,482 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=22.5 2023-10-13 03:49:28,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1269828.0, ans=0.0 2023-10-13 03:49:38,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1269874.6666666667, ans=0.07 2023-10-13 03:49:39,272 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=15.0 2023-10-13 03:50:19,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.780e+02 1.934e+02 2.131e+02 2.836e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-13 03:50:45,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1270154.6666666667, ans=0.09899494936611666 2023-10-13 03:51:01,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1270201.3333333333, ans=0.125 2023-10-13 03:51:16,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-10-13 03:51:38,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1270341.3333333333, ans=0.125 2023-10-13 03:51:49,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1270388.0, ans=0.015 2023-10-13 03:52:08,404 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=15.0 2023-10-13 03:52:16,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.828e+02 2.012e+02 2.239e+02 3.274e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-13 03:52:20,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1270528.0, ans=0.1 2023-10-13 03:52:27,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1270528.0, ans=0.2 2023-10-13 03:52:32,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1270574.6666666667, ans=0.125 2023-10-13 03:52:41,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1270574.6666666667, ans=15.0 2023-10-13 03:52:42,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1270621.3333333333, ans=0.1 2023-10-13 03:52:46,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1270621.3333333333, ans=0.2 2023-10-13 03:52:46,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.71 vs. limit=22.5 2023-10-13 03:52:53,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1270621.3333333333, ans=0.0 2023-10-13 03:52:59,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.83 vs. limit=15.0 2023-10-13 03:53:04,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-10-13 03:53:18,298 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.59 vs. limit=22.5 2023-10-13 03:53:21,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1270761.3333333333, ans=0.0 2023-10-13 03:54:16,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.752e+02 1.880e+02 2.181e+02 2.887e+02, threshold=3.760e+02, percent-clipped=0.0 2023-10-13 03:54:16,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1270948.0, ans=0.125 2023-10-13 03:54:17,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1270948.0, ans=0.0 2023-10-13 03:54:41,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1271041.3333333333, ans=0.1 2023-10-13 03:54:43,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1271088.0, ans=0.0 2023-10-13 03:54:51,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1271088.0, ans=0.1 2023-10-13 03:54:53,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1271088.0, ans=0.1 2023-10-13 03:55:00,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1271134.6666666667, ans=0.2 2023-10-13 03:55:03,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1271134.6666666667, ans=0.09899494936611666 2023-10-13 03:55:18,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1271181.3333333333, ans=0.1 2023-10-13 03:55:19,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1271181.3333333333, ans=0.125 2023-10-13 03:55:29,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1271228.0, ans=0.125 2023-10-13 03:55:31,421 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-10-13 03:55:38,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1271274.6666666667, ans=0.125 2023-10-13 03:55:39,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1271274.6666666667, ans=0.5 2023-10-13 03:55:45,459 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.63 vs. limit=15.0 2023-10-13 03:55:54,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-10-13 03:56:00,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1271368.0, ans=0.125 2023-10-13 03:56:07,262 INFO [train.py:1031] (0/4) Epoch 20, batch 13000, loss[loss=0.2014, simple_loss=0.2919, pruned_loss=0.05551, over 17011.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2817, pruned_loss=0.04894, over 32768077.71 frames. ], batch size: 117, lr: 1.71e-03, grad_scale: 32.0 2023-10-13 03:56:14,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1271414.6666666667, ans=0.0 2023-10-13 03:56:14,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.786e+02 1.965e+02 2.254e+02 3.104e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-13 03:56:32,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1271461.3333333333, ans=0.125 2023-10-13 03:56:36,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1271508.0, ans=0.125 2023-10-13 03:56:49,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1271554.6666666667, ans=0.125 2023-10-13 03:56:52,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1271554.6666666667, ans=0.0 2023-10-13 03:56:54,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=12.0 2023-10-13 03:56:54,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1271554.6666666667, ans=0.0 2023-10-13 03:57:07,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-10-13 03:57:11,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1271648.0, ans=0.125 2023-10-13 03:57:43,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1271741.3333333333, ans=0.09899494936611666 2023-10-13 03:57:58,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.38 vs. limit=12.0 2023-10-13 03:58:02,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.59 vs. limit=15.0 2023-10-13 03:58:22,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.762e+02 1.942e+02 2.238e+02 3.038e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 03:58:23,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.05 vs. limit=6.0 2023-10-13 03:58:23,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1271881.3333333333, ans=6.0 2023-10-13 03:58:27,388 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.62 vs. limit=15.0 2023-10-13 03:58:28,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-10-13 03:58:35,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1271928.0, ans=0.125 2023-10-13 03:58:46,253 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=15.0 2023-10-13 03:59:01,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.96 vs. limit=15.0 2023-10-13 03:59:10,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1272068.0, ans=0.0 2023-10-13 03:59:15,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1272114.6666666667, ans=0.04949747468305833 2023-10-13 03:59:22,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1272114.6666666667, ans=0.0 2023-10-13 03:59:31,016 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=22.5 2023-10-13 03:59:40,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1272208.0, ans=0.1 2023-10-13 03:59:41,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1272208.0, ans=0.1 2023-10-13 04:00:07,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1272301.3333333333, ans=0.125 2023-10-13 04:00:14,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1272348.0, ans=0.125 2023-10-13 04:00:23,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.803e+02 1.963e+02 2.310e+02 3.342e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-13 04:00:29,860 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.36 vs. limit=15.0 2023-10-13 04:00:36,729 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.20 vs. limit=15.0 2023-10-13 04:01:16,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1272581.3333333333, ans=0.125 2023-10-13 04:01:29,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1272628.0, ans=0.1 2023-10-13 04:01:48,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1272674.6666666667, ans=0.125 2023-10-13 04:01:50,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1272721.3333333333, ans=0.1 2023-10-13 04:01:51,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1272721.3333333333, ans=0.125 2023-10-13 04:01:52,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1272721.3333333333, ans=0.125 2023-10-13 04:02:16,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272814.6666666667, ans=0.1 2023-10-13 04:02:16,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1272814.6666666667, ans=0.0 2023-10-13 04:02:24,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.833e+02 2.095e+02 2.469e+02 3.464e+02, threshold=4.189e+02, percent-clipped=0.0 2023-10-13 04:03:03,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1273001.3333333333, ans=0.125 2023-10-13 04:03:29,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1273094.6666666667, ans=0.0 2023-10-13 04:03:42,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1273141.3333333333, ans=0.0 2023-10-13 04:03:43,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1273141.3333333333, ans=0.025 2023-10-13 04:04:04,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.28 vs. limit=22.5 2023-10-13 04:04:21,588 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.808e+02 1.946e+02 2.168e+02 2.800e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 04:04:22,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1273328.0, ans=0.2 2023-10-13 04:04:48,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.39 vs. limit=15.0 2023-10-13 04:04:56,617 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:04:56,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.15 vs. limit=22.5 2023-10-13 04:05:21,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1273561.3333333333, ans=0.125 2023-10-13 04:05:36,012 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-10-13 04:06:06,133 INFO [train.py:1031] (0/4) Epoch 20, batch 13500, loss[loss=0.22, simple_loss=0.3038, pruned_loss=0.06806, over 16022.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2812, pruned_loss=0.0489, over 32767853.38 frames. ], batch size: 296, lr: 1.71e-03, grad_scale: 16.0 2023-10-13 04:06:15,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.737e+02 1.875e+02 2.025e+02 2.814e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-13 04:06:21,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1273794.6666666667, ans=0.0 2023-10-13 04:06:33,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1273841.3333333333, ans=0.2 2023-10-13 04:06:43,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1273888.0, ans=0.1 2023-10-13 04:06:54,755 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.84 vs. limit=15.0 2023-10-13 04:07:09,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1273981.3333333333, ans=0.125 2023-10-13 04:07:15,275 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.09 vs. limit=12.0 2023-10-13 04:07:23,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1274028.0, ans=0.125 2023-10-13 04:07:23,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1274028.0, ans=0.2 2023-10-13 04:07:57,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1274168.0, ans=0.0 2023-10-13 04:07:57,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=12.0 2023-10-13 04:08:09,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.60 vs. limit=6.0 2023-10-13 04:08:16,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.871e+02 2.015e+02 2.247e+02 3.274e+02, threshold=4.030e+02, percent-clipped=0.0 2023-10-13 04:08:19,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1274261.3333333333, ans=0.1 2023-10-13 04:08:19,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.90 vs. limit=15.0 2023-10-13 04:08:23,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1274261.3333333333, ans=0.09899494936611666 2023-10-13 04:08:33,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1274308.0, ans=0.07 2023-10-13 04:08:34,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.96 vs. limit=15.0 2023-10-13 04:08:41,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1274354.6666666667, ans=0.125 2023-10-13 04:08:45,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1274354.6666666667, ans=0.1 2023-10-13 04:08:48,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.44 vs. limit=22.5 2023-10-13 04:09:05,118 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-20.pt 2023-10-13 04:09:36,828 INFO [train.py:1031] (0/4) Epoch 21, batch 0, loss[loss=0.1555, simple_loss=0.2508, pruned_loss=0.0301, over 16084.00 frames. ], tot_loss[loss=0.1555, simple_loss=0.2508, pruned_loss=0.0301, over 16084.00 frames. ], batch size: 43, lr: 1.67e-03, grad_scale: 32.0 2023-10-13 04:09:36,830 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-13 04:09:46,465 INFO [train.py:1063] (0/4) Epoch 21, validation: loss=0.2147, simple_loss=0.3014, pruned_loss=0.06396, over 1020973.00 frames. 2023-10-13 04:09:46,467 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-13 04:10:08,927 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:10:11,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1274564.6666666667, ans=0.125 2023-10-13 04:10:20,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.96 vs. limit=10.0 2023-10-13 04:10:21,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1274611.3333333333, ans=0.0 2023-10-13 04:10:29,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1274611.3333333333, ans=0.125 2023-10-13 04:10:40,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1274658.0, ans=0.0 2023-10-13 04:10:40,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1274658.0, ans=0.5 2023-10-13 04:10:50,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1274704.6666666667, ans=0.0 2023-10-13 04:10:52,995 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.345e+02 1.778e+02 1.911e+02 2.109e+02 4.365e+02, threshold=3.822e+02, percent-clipped=1.0 2023-10-13 04:10:53,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1274704.6666666667, ans=0.1 2023-10-13 04:10:54,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1274704.6666666667, ans=0.125 2023-10-13 04:11:05,418 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-10-13 04:11:09,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1274751.3333333333, ans=0.125 2023-10-13 04:11:23,784 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:11:28,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1274844.6666666667, ans=0.125 2023-10-13 04:11:39,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1274891.3333333333, ans=0.0 2023-10-13 04:12:03,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-10-13 04:12:30,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1275078.0, ans=0.1 2023-10-13 04:12:32,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1275078.0, ans=0.1 2023-10-13 04:12:36,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1275124.6666666667, ans=0.125 2023-10-13 04:12:37,893 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:12:45,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1275124.6666666667, ans=0.125 2023-10-13 04:12:49,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1275171.3333333333, ans=0.125 2023-10-13 04:12:51,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.709e+02 1.856e+02 1.998e+02 2.635e+02, threshold=3.712e+02, percent-clipped=0.0 2023-10-13 04:13:12,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1275264.6666666667, ans=0.125 2023-10-13 04:13:35,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1275358.0, ans=0.05 2023-10-13 04:13:46,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1275404.6666666667, ans=0.0 2023-10-13 04:14:18,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1275544.6666666667, ans=0.1 2023-10-13 04:14:47,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1275638.0, ans=0.025 2023-10-13 04:14:48,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1275638.0, ans=0.125 2023-10-13 04:14:53,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.757e+02 1.953e+02 2.230e+02 3.266e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-13 04:14:57,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1275684.6666666667, ans=0.125 2023-10-13 04:15:00,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.16 vs. limit=15.0 2023-10-13 04:15:09,110 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.59 vs. limit=6.0 2023-10-13 04:15:13,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1275731.3333333333, ans=0.1 2023-10-13 04:15:13,719 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.80 vs. limit=22.5 2023-10-13 04:15:19,160 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:16:12,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1275964.6666666667, ans=0.125 2023-10-13 04:16:14,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1275964.6666666667, ans=0.0 2023-10-13 04:16:16,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1275964.6666666667, ans=0.0 2023-10-13 04:16:22,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1275964.6666666667, ans=15.0 2023-10-13 04:16:33,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1276011.3333333333, ans=0.1 2023-10-13 04:16:33,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=15.0 2023-10-13 04:16:53,668 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.746e+02 1.936e+02 2.165e+02 2.957e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-13 04:16:58,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1276151.3333333333, ans=0.125 2023-10-13 04:17:34,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1276291.3333333333, ans=0.125 2023-10-13 04:17:37,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1276291.3333333333, ans=0.2 2023-10-13 04:17:51,847 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:17:51,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1276338.0, ans=0.0 2023-10-13 04:18:02,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1276384.6666666667, ans=0.125 2023-10-13 04:18:16,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1276431.3333333333, ans=0.1 2023-10-13 04:18:24,079 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.36 vs. limit=6.0 2023-10-13 04:18:45,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1276571.3333333333, ans=0.1 2023-10-13 04:18:53,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.858e+02 2.031e+02 2.356e+02 2.993e+02, threshold=4.062e+02, percent-clipped=0.0 2023-10-13 04:18:53,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1276571.3333333333, ans=0.125 2023-10-13 04:19:07,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1276618.0, ans=0.125 2023-10-13 04:19:12,671 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:19:12,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1276664.6666666667, ans=0.125 2023-10-13 04:19:14,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1276664.6666666667, ans=0.2 2023-10-13 04:19:31,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1276711.3333333333, ans=0.07 2023-10-13 04:19:44,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1276758.0, ans=0.1 2023-10-13 04:19:44,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1276758.0, ans=0.1 2023-10-13 04:19:47,566 INFO [train.py:1031] (0/4) Epoch 21, batch 500, loss[loss=0.1853, simple_loss=0.2724, pruned_loss=0.04906, over 16935.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2815, pruned_loss=0.04975, over 7287397.64 frames. ], batch size: 138, lr: 1.66e-03, grad_scale: 16.0 2023-10-13 04:20:10,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1276898.0, ans=0.125 2023-10-13 04:20:14,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1276898.0, ans=0.025 2023-10-13 04:20:18,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1276898.0, ans=0.2 2023-10-13 04:20:31,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1276944.6666666667, ans=0.2 2023-10-13 04:20:38,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-10-13 04:20:53,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.756e+02 1.988e+02 2.304e+02 3.336e+02, threshold=3.975e+02, percent-clipped=0.0 2023-10-13 04:20:56,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1277038.0, ans=0.04949747468305833 2023-10-13 04:21:09,818 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:21:10,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1277131.3333333333, ans=0.125 2023-10-13 04:21:19,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1277131.3333333333, ans=0.0 2023-10-13 04:21:52,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1277271.3333333333, ans=0.0 2023-10-13 04:22:13,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1277364.6666666667, ans=0.125 2023-10-13 04:22:14,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1277364.6666666667, ans=0.0 2023-10-13 04:22:24,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-10-13 04:22:28,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1277411.3333333333, ans=15.0 2023-10-13 04:22:54,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.875e+02 2.120e+02 2.491e+02 4.231e+02, threshold=4.239e+02, percent-clipped=1.0 2023-10-13 04:23:13,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.23 vs. limit=15.0 2023-10-13 04:23:37,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1277691.3333333333, ans=0.125 2023-10-13 04:23:45,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1277738.0, ans=0.125 2023-10-13 04:23:45,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=22.5 2023-10-13 04:23:53,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1277738.0, ans=0.125 2023-10-13 04:23:55,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1277784.6666666667, ans=0.1 2023-10-13 04:23:57,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1277784.6666666667, ans=0.0 2023-10-13 04:24:06,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1277831.3333333333, ans=0.0 2023-10-13 04:24:10,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1277831.3333333333, ans=0.125 2023-10-13 04:24:26,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1277878.0, ans=0.05 2023-10-13 04:24:30,666 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-10-13 04:24:54,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.835e+02 1.986e+02 2.230e+02 3.170e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-13 04:25:04,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1278018.0, ans=0.1 2023-10-13 04:25:24,135 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.26 vs. limit=22.5 2023-10-13 04:25:30,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1278158.0, ans=0.125 2023-10-13 04:26:36,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.59 vs. limit=6.0 2023-10-13 04:26:36,998 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:26:39,199 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.56 vs. limit=10.0 2023-10-13 04:26:44,652 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.29 vs. limit=15.0 2023-10-13 04:26:45,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1278391.3333333333, ans=0.0 2023-10-13 04:26:51,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1278438.0, ans=0.1 2023-10-13 04:26:53,551 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.732e+02 1.908e+02 2.131e+02 2.938e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-13 04:26:53,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1278438.0, ans=0.0 2023-10-13 04:27:03,862 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-10-13 04:27:12,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1278531.3333333333, ans=0.2 2023-10-13 04:27:38,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1278624.6666666667, ans=0.125 2023-10-13 04:27:46,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1278671.3333333333, ans=0.0 2023-10-13 04:27:47,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1278671.3333333333, ans=0.2 2023-10-13 04:27:58,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1278718.0, ans=0.125 2023-10-13 04:28:02,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1278718.0, ans=0.125 2023-10-13 04:28:05,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1278718.0, ans=0.0 2023-10-13 04:28:10,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1278764.6666666667, ans=0.125 2023-10-13 04:28:31,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1278811.3333333333, ans=0.2 2023-10-13 04:28:31,742 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.09 vs. limit=15.0 2023-10-13 04:28:45,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1278858.0, ans=0.125 2023-10-13 04:28:46,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1278858.0, ans=0.5 2023-10-13 04:28:46,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-10-13 04:28:50,388 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.24 vs. limit=15.0 2023-10-13 04:28:58,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.776e+02 1.919e+02 2.127e+02 2.738e+02, threshold=3.837e+02, percent-clipped=0.0 2023-10-13 04:29:15,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1278998.0, ans=0.125 2023-10-13 04:29:30,908 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-10-13 04:29:47,168 INFO [train.py:1031] (0/4) Epoch 21, batch 1000, loss[loss=0.1908, simple_loss=0.2846, pruned_loss=0.04848, over 16140.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2823, pruned_loss=0.04977, over 12934735.43 frames. ], batch size: 43, lr: 1.66e-03, grad_scale: 16.0 2023-10-13 04:30:21,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1279278.0, ans=0.125 2023-10-13 04:30:26,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1279278.0, ans=0.125 2023-10-13 04:30:52,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.791e+02 1.992e+02 2.270e+02 3.268e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-13 04:31:14,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=22.5 2023-10-13 04:31:34,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1279511.3333333333, ans=0.025 2023-10-13 04:31:37,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1279558.0, ans=0.125 2023-10-13 04:32:10,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279651.3333333333, ans=0.1 2023-10-13 04:32:23,790 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.20 vs. limit=6.0 2023-10-13 04:33:01,113 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.739e+02 1.878e+02 2.081e+02 3.347e+02, threshold=3.755e+02, percent-clipped=0.0 2023-10-13 04:33:07,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279884.6666666667, ans=0.1 2023-10-13 04:33:10,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279884.6666666667, ans=0.1 2023-10-13 04:33:13,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1279884.6666666667, ans=0.125 2023-10-13 04:33:13,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1279884.6666666667, ans=0.125 2023-10-13 04:33:16,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1279884.6666666667, ans=0.2 2023-10-13 04:33:26,393 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-10-13 04:33:38,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1279978.0, ans=0.1 2023-10-13 04:33:59,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1280071.3333333333, ans=0.0 2023-10-13 04:34:19,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1280118.0, ans=0.05 2023-10-13 04:34:20,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1280118.0, ans=0.125 2023-10-13 04:34:25,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1280164.6666666667, ans=0.1 2023-10-13 04:34:37,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1280211.3333333333, ans=0.2 2023-10-13 04:34:51,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1280258.0, ans=0.09899494936611666 2023-10-13 04:35:01,529 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:35:07,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.671e+02 1.804e+02 1.970e+02 2.415e+02, threshold=3.609e+02, percent-clipped=0.0 2023-10-13 04:35:16,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1280351.3333333333, ans=0.2 2023-10-13 04:35:39,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1280444.6666666667, ans=6.0 2023-10-13 04:36:08,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1280584.6666666667, ans=0.1 2023-10-13 04:36:12,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1280584.6666666667, ans=0.1 2023-10-13 04:36:20,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1280631.3333333333, ans=0.125 2023-10-13 04:36:20,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1280631.3333333333, ans=0.125 2023-10-13 04:36:42,142 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-10-13 04:37:04,304 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.349e+02 1.753e+02 1.910e+02 2.196e+02 3.250e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-13 04:37:06,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1280818.0, ans=0.04949747468305833 2023-10-13 04:37:26,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.52 vs. limit=22.5 2023-10-13 04:37:35,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1280911.3333333333, ans=0.125 2023-10-13 04:37:45,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=1280958.0, ans=15.0 2023-10-13 04:38:02,928 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:38:08,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1281051.3333333333, ans=0.0 2023-10-13 04:38:56,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1281191.3333333333, ans=0.125 2023-10-13 04:39:00,550 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:39:08,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.727e+02 1.845e+02 2.055e+02 2.595e+02, threshold=3.690e+02, percent-clipped=0.0 2023-10-13 04:39:24,533 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.50 vs. limit=10.0 2023-10-13 04:39:30,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1281331.3333333333, ans=10.0 2023-10-13 04:39:31,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1281331.3333333333, ans=0.1 2023-10-13 04:40:04,098 INFO [train.py:1031] (0/4) Epoch 21, batch 1500, loss[loss=0.1767, simple_loss=0.2745, pruned_loss=0.0394, over 16874.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2807, pruned_loss=0.049, over 17339309.54 frames. ], batch size: 155, lr: 1.66e-03, grad_scale: 16.0 2023-10-13 04:40:19,386 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:40:21,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1281518.0, ans=0.2 2023-10-13 04:40:22,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1281518.0, ans=15.0 2023-10-13 04:40:23,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1281518.0, ans=0.0 2023-10-13 04:40:43,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.67 vs. limit=15.0 2023-10-13 04:40:47,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1281611.3333333333, ans=0.0 2023-10-13 04:40:59,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1281658.0, ans=0.125 2023-10-13 04:41:05,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1281704.6666666667, ans=0.04949747468305833 2023-10-13 04:41:07,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1281704.6666666667, ans=0.125 2023-10-13 04:41:15,466 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-10-13 04:41:18,725 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.727e+02 1.896e+02 2.100e+02 2.801e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-13 04:41:23,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1281751.3333333333, ans=0.125 2023-10-13 04:41:24,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1281751.3333333333, ans=0.125 2023-10-13 04:41:35,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1281798.0, ans=0.0 2023-10-13 04:41:54,765 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=22.5 2023-10-13 04:42:10,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.48 vs. limit=10.0 2023-10-13 04:42:15,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1281938.0, ans=0.125 2023-10-13 04:42:15,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1281938.0, ans=0.1 2023-10-13 04:42:24,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-10-13 04:42:41,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1282031.3333333333, ans=0.125 2023-10-13 04:42:59,970 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-10-13 04:43:14,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1282124.6666666667, ans=0.2 2023-10-13 04:43:32,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.785e+02 1.977e+02 2.237e+02 3.526e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-13 04:43:48,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1282264.6666666667, ans=0.2 2023-10-13 04:44:12,811 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.20 vs. limit=10.0 2023-10-13 04:44:44,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1282451.3333333333, ans=0.0 2023-10-13 04:44:47,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1282498.0, ans=0.2 2023-10-13 04:45:13,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1282591.3333333333, ans=0.125 2023-10-13 04:45:21,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.62 vs. limit=22.5 2023-10-13 04:45:34,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.306e+02 1.761e+02 1.901e+02 2.067e+02 2.761e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-13 04:45:41,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1282684.6666666667, ans=0.1 2023-10-13 04:45:47,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1282684.6666666667, ans=0.0 2023-10-13 04:46:08,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1282778.0, ans=0.0 2023-10-13 04:46:20,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1282778.0, ans=0.125 2023-10-13 04:46:28,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1282824.6666666667, ans=0.035 2023-10-13 04:46:32,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1282824.6666666667, ans=0.125 2023-10-13 04:46:58,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1282918.0, ans=0.0 2023-10-13 04:47:20,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1283011.3333333333, ans=0.125 2023-10-13 04:47:25,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1283011.3333333333, ans=0.2 2023-10-13 04:47:26,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1283011.3333333333, ans=0.125 2023-10-13 04:47:35,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1283058.0, ans=0.2 2023-10-13 04:47:50,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.738e+02 1.895e+02 2.074e+02 3.093e+02, threshold=3.789e+02, percent-clipped=0.0 2023-10-13 04:47:56,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1283151.3333333333, ans=0.125 2023-10-13 04:48:02,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1283151.3333333333, ans=0.125 2023-10-13 04:48:16,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1283198.0, ans=0.0 2023-10-13 04:48:16,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1283198.0, ans=0.125 2023-10-13 04:48:16,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1283198.0, ans=0.0 2023-10-13 04:48:24,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1283244.6666666667, ans=0.2 2023-10-13 04:49:08,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.27 vs. limit=22.5 2023-10-13 04:49:13,737 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-10-13 04:49:54,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1283524.6666666667, ans=0.0 2023-10-13 04:50:05,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.63 vs. limit=22.5 2023-10-13 04:50:10,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.712e+02 1.899e+02 2.083e+02 2.784e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-13 04:50:14,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1283618.0, ans=0.1 2023-10-13 04:50:41,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1283711.3333333333, ans=0.125 2023-10-13 04:51:15,293 INFO [train.py:1031] (0/4) Epoch 21, batch 2000, loss[loss=0.1881, simple_loss=0.2895, pruned_loss=0.04328, over 16638.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2815, pruned_loss=0.0492, over 20738867.02 frames. ], batch size: 202, lr: 1.66e-03, grad_scale: 32.0 2023-10-13 04:51:28,994 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-10-13 04:51:41,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1283851.3333333333, ans=0.125 2023-10-13 04:52:02,609 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:52:09,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1283944.6666666667, ans=0.125 2023-10-13 04:52:41,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.736e+02 1.861e+02 2.041e+02 3.044e+02, threshold=3.721e+02, percent-clipped=0.0 2023-10-13 04:52:52,956 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.73 vs. limit=22.5 2023-10-13 04:53:15,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1284178.0, ans=0.125 2023-10-13 04:53:20,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1284224.6666666667, ans=0.0 2023-10-13 04:54:25,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1284364.6666666667, ans=0.0 2023-10-13 04:54:26,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1284364.6666666667, ans=0.0 2023-10-13 04:54:27,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1284364.6666666667, ans=0.0 2023-10-13 04:55:19,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1284504.6666666667, ans=0.05 2023-10-13 04:55:24,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.764e+02 2.072e+02 2.386e+02 3.198e+02, threshold=4.144e+02, percent-clipped=0.0 2023-10-13 04:55:41,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1284551.3333333333, ans=0.125 2023-10-13 04:56:20,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1284691.3333333333, ans=0.0 2023-10-13 04:56:34,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1284738.0, ans=0.125 2023-10-13 04:56:46,238 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 04:56:47,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1284784.6666666667, ans=0.0 2023-10-13 04:56:47,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-10-13 04:56:48,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1284784.6666666667, ans=0.1 2023-10-13 04:56:58,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1284831.3333333333, ans=0.125 2023-10-13 04:56:59,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1284831.3333333333, ans=0.0 2023-10-13 04:57:11,229 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.66 vs. limit=15.0 2023-10-13 04:57:39,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1284971.3333333333, ans=0.2 2023-10-13 04:57:40,770 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.833e+02 1.992e+02 2.230e+02 2.711e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-13 04:58:23,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1285158.0, ans=0.0 2023-10-13 04:58:29,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1285158.0, ans=0.0 2023-10-13 04:58:48,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1285204.6666666667, ans=0.125 2023-10-13 04:58:53,662 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.30 vs. limit=15.0 2023-10-13 04:59:12,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1285298.0, ans=0.2 2023-10-13 04:59:28,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1285344.6666666667, ans=0.1 2023-10-13 04:59:46,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1285438.0, ans=0.125 2023-10-13 04:59:48,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.55 vs. limit=15.0 2023-10-13 04:59:52,237 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.811e+02 1.973e+02 2.186e+02 2.951e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-13 04:59:55,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1285484.6666666667, ans=0.125 2023-10-13 05:00:43,300 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:00:48,856 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:00:49,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1285624.6666666667, ans=0.125 2023-10-13 05:00:56,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1285671.3333333333, ans=0.0 2023-10-13 05:01:07,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1285718.0, ans=0.125 2023-10-13 05:01:40,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2023-10-13 05:01:53,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1285858.0, ans=0.125 2023-10-13 05:01:56,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1285858.0, ans=0.0 2023-10-13 05:02:11,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.778e+02 1.905e+02 2.097e+02 3.671e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-13 05:02:17,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1285951.3333333333, ans=0.125 2023-10-13 05:02:17,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1285951.3333333333, ans=0.0 2023-10-13 05:02:18,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1285951.3333333333, ans=0.2 2023-10-13 05:02:28,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1285998.0, ans=0.2 2023-10-13 05:02:42,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1286044.6666666667, ans=0.125 2023-10-13 05:02:46,540 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=22.5 2023-10-13 05:02:50,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1286091.3333333333, ans=0.0 2023-10-13 05:03:00,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1286091.3333333333, ans=0.0 2023-10-13 05:03:06,274 INFO [train.py:1031] (0/4) Epoch 21, batch 2500, loss[loss=0.1724, simple_loss=0.2443, pruned_loss=0.05021, over 12533.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2815, pruned_loss=0.0492, over 23403396.92 frames. ], batch size: 440, lr: 1.66e-03, grad_scale: 32.0 2023-10-13 05:03:09,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.02 vs. limit=15.0 2023-10-13 05:04:03,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1286324.6666666667, ans=0.125 2023-10-13 05:04:05,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1286324.6666666667, ans=0.2 2023-10-13 05:04:13,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1286371.3333333333, ans=0.1 2023-10-13 05:04:24,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.787e+02 1.900e+02 2.151e+02 3.018e+02, threshold=3.800e+02, percent-clipped=0.0 2023-10-13 05:04:24,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1286371.3333333333, ans=0.0 2023-10-13 05:04:35,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.28 vs. limit=15.0 2023-10-13 05:04:38,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1286464.6666666667, ans=0.125 2023-10-13 05:04:47,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1286464.6666666667, ans=0.125 2023-10-13 05:04:55,264 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.57 vs. limit=22.5 2023-10-13 05:04:55,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1286511.3333333333, ans=0.1 2023-10-13 05:05:06,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1286558.0, ans=0.5 2023-10-13 05:05:24,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1286604.6666666667, ans=0.125 2023-10-13 05:05:45,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1286698.0, ans=0.125 2023-10-13 05:05:45,492 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.52 vs. limit=15.0 2023-10-13 05:05:50,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1286698.0, ans=0.0 2023-10-13 05:05:54,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1286698.0, ans=0.0 2023-10-13 05:06:10,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1286791.3333333333, ans=0.07 2023-10-13 05:06:32,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.765e+02 1.934e+02 2.156e+02 2.781e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-13 05:06:35,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1286884.6666666667, ans=0.125 2023-10-13 05:06:38,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=15.0 2023-10-13 05:06:47,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1286931.3333333333, ans=0.0 2023-10-13 05:07:06,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1286978.0, ans=0.125 2023-10-13 05:07:06,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1286978.0, ans=0.125 2023-10-13 05:07:24,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1287071.3333333333, ans=0.125 2023-10-13 05:07:43,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1287118.0, ans=0.125 2023-10-13 05:07:48,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1287118.0, ans=0.0 2023-10-13 05:07:54,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1287118.0, ans=0.05 2023-10-13 05:08:10,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1287211.3333333333, ans=0.125 2023-10-13 05:08:11,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1287211.3333333333, ans=0.1 2023-10-13 05:08:17,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1287211.3333333333, ans=15.0 2023-10-13 05:08:51,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1287304.6666666667, ans=0.1 2023-10-13 05:08:52,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.760e+02 1.935e+02 2.149e+02 3.036e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-13 05:08:57,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=1287351.3333333333, ans=0.2 2023-10-13 05:08:58,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1287351.3333333333, ans=0.0 2023-10-13 05:09:35,535 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.84 vs. limit=12.0 2023-10-13 05:10:28,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.84 vs. limit=15.0 2023-10-13 05:10:29,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1287631.3333333333, ans=0.0 2023-10-13 05:10:39,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1287678.0, ans=0.0 2023-10-13 05:10:39,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1287678.0, ans=0.125 2023-10-13 05:11:07,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1287724.6666666667, ans=0.0 2023-10-13 05:11:24,884 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.807e+02 1.980e+02 2.185e+02 3.348e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-13 05:11:50,219 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-10-13 05:12:24,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1287958.0, ans=0.0 2023-10-13 05:12:35,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1288004.6666666667, ans=0.1 2023-10-13 05:12:43,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1288004.6666666667, ans=0.0 2023-10-13 05:13:57,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.738e+02 1.887e+02 2.157e+02 2.867e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-13 05:13:57,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1288284.6666666667, ans=0.1 2023-10-13 05:14:14,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1288331.3333333333, ans=0.2 2023-10-13 05:14:37,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1288424.6666666667, ans=0.0 2023-10-13 05:14:49,897 INFO [train.py:1031] (0/4) Epoch 21, batch 3000, loss[loss=0.235, simple_loss=0.2987, pruned_loss=0.08569, over 15694.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2809, pruned_loss=0.04916, over 25495786.93 frames. ], batch size: 350, lr: 1.66e-03, grad_scale: 16.0 2023-10-13 05:15:14,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1288564.6666666667, ans=0.125 2023-10-13 05:15:33,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1288611.3333333333, ans=0.1 2023-10-13 05:15:33,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1288611.3333333333, ans=0.05 2023-10-13 05:15:55,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1288704.6666666667, ans=0.125 2023-10-13 05:16:06,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1288704.6666666667, ans=0.125 2023-10-13 05:16:09,586 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.818e+02 2.001e+02 2.173e+02 3.215e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-13 05:16:21,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.53 vs. limit=22.5 2023-10-13 05:16:35,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1288844.6666666667, ans=0.2 2023-10-13 05:17:28,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1288984.6666666667, ans=0.04949747468305833 2023-10-13 05:18:24,384 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.783e+02 1.987e+02 2.190e+02 2.624e+02, threshold=3.975e+02, percent-clipped=0.0 2023-10-13 05:18:46,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1289264.6666666667, ans=0.0 2023-10-13 05:18:48,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.86 vs. limit=10.0 2023-10-13 05:19:26,213 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.71 vs. limit=6.0 2023-10-13 05:19:29,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1289404.6666666667, ans=0.125 2023-10-13 05:19:31,040 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=22.5 2023-10-13 05:19:36,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1289451.3333333333, ans=0.0 2023-10-13 05:19:49,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-10-13 05:20:11,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1289544.6666666667, ans=0.0 2023-10-13 05:20:14,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1289544.6666666667, ans=0.015 2023-10-13 05:20:14,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1289544.6666666667, ans=0.125 2023-10-13 05:20:16,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1289544.6666666667, ans=0.025 2023-10-13 05:20:28,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1289591.3333333333, ans=0.2 2023-10-13 05:20:31,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-10-13 05:21:00,366 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.786e+02 1.918e+02 2.092e+02 3.242e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-13 05:21:17,730 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:22:23,579 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-10-13 05:22:33,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1289964.6666666667, ans=0.125 2023-10-13 05:22:37,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1289964.6666666667, ans=0.125 2023-10-13 05:22:39,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1289964.6666666667, ans=0.0 2023-10-13 05:22:41,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1289964.6666666667, ans=0.125 2023-10-13 05:22:51,537 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.93 vs. limit=22.5 2023-10-13 05:23:01,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1290011.3333333333, ans=0.125 2023-10-13 05:23:22,342 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-10-13 05:23:32,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.783e+02 1.952e+02 2.161e+02 3.717e+02, threshold=3.903e+02, percent-clipped=0.0 2023-10-13 05:23:39,833 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-10-13 05:23:53,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1290198.0, ans=0.07 2023-10-13 05:24:36,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1290338.0, ans=0.125 2023-10-13 05:24:39,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1290338.0, ans=0.125 2023-10-13 05:25:01,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1290431.3333333333, ans=0.125 2023-10-13 05:25:14,649 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.36 vs. limit=22.5 2023-10-13 05:25:44,656 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.65 vs. limit=22.5 2023-10-13 05:25:49,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.773e+02 1.995e+02 2.272e+02 3.151e+02, threshold=3.990e+02, percent-clipped=0.0 2023-10-13 05:25:51,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1290618.0, ans=0.125 2023-10-13 05:26:09,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1290664.6666666667, ans=0.125 2023-10-13 05:26:44,816 INFO [train.py:1031] (0/4) Epoch 21, batch 3500, loss[loss=0.1867, simple_loss=0.2804, pruned_loss=0.04647, over 16922.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2808, pruned_loss=0.04925, over 27116304.53 frames. ], batch size: 165, lr: 1.66e-03, grad_scale: 32.0 2023-10-13 05:26:48,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.67 vs. limit=22.5 2023-10-13 05:27:07,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1290851.3333333333, ans=0.125 2023-10-13 05:27:18,674 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:27:24,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1290944.6666666667, ans=0.0 2023-10-13 05:27:48,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1291038.0, ans=0.125 2023-10-13 05:27:55,794 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.80 vs. limit=6.0 2023-10-13 05:28:00,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.754e+02 1.916e+02 2.165e+02 2.867e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-13 05:28:54,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1291224.6666666667, ans=0.125 2023-10-13 05:29:09,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1291271.3333333333, ans=0.04949747468305833 2023-10-13 05:29:55,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1291411.3333333333, ans=0.2 2023-10-13 05:30:12,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1291458.0, ans=0.125 2023-10-13 05:30:43,266 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.740e+02 1.880e+02 2.019e+02 3.193e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-13 05:30:51,820 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.39 vs. limit=15.0 2023-10-13 05:31:04,180 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.78 vs. limit=15.0 2023-10-13 05:31:10,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1291644.6666666667, ans=0.125 2023-10-13 05:31:19,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1291644.6666666667, ans=0.125 2023-10-13 05:32:04,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1291784.6666666667, ans=0.1 2023-10-13 05:32:05,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.83 vs. limit=15.0 2023-10-13 05:32:10,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1291784.6666666667, ans=0.0 2023-10-13 05:32:43,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1291878.0, ans=0.125 2023-10-13 05:32:56,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1291924.6666666667, ans=0.125 2023-10-13 05:33:02,070 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=12.0 2023-10-13 05:33:22,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.710e+02 1.951e+02 2.172e+02 3.159e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-13 05:33:33,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1292064.6666666667, ans=0.125 2023-10-13 05:33:48,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1292111.3333333333, ans=0.125 2023-10-13 05:33:51,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1292111.3333333333, ans=0.125 2023-10-13 05:34:03,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1292158.0, ans=0.125 2023-10-13 05:34:10,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1292158.0, ans=0.1 2023-10-13 05:34:39,697 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-10-13 05:34:51,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1292298.0, ans=0.0 2023-10-13 05:34:56,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1292344.6666666667, ans=0.1 2023-10-13 05:34:58,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.46 vs. limit=12.0 2023-10-13 05:35:03,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1292344.6666666667, ans=0.2 2023-10-13 05:35:10,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1292391.3333333333, ans=0.0 2023-10-13 05:35:12,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1292391.3333333333, ans=0.025 2023-10-13 05:35:33,460 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.94 vs. limit=15.0 2023-10-13 05:35:38,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.773e+02 1.929e+02 2.163e+02 3.187e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-13 05:35:54,934 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-10-13 05:35:58,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1292531.3333333333, ans=0.1 2023-10-13 05:36:20,418 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.04 vs. limit=15.0 2023-10-13 05:36:44,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1292718.0, ans=0.0 2023-10-13 05:37:24,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1292858.0, ans=0.125 2023-10-13 05:37:42,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1292904.6666666667, ans=6.0 2023-10-13 05:37:47,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.755e+02 1.939e+02 2.226e+02 3.420e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-13 05:37:51,293 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.34 vs. limit=22.5 2023-10-13 05:38:10,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1292998.0, ans=0.125 2023-10-13 05:38:20,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1293044.6666666667, ans=0.125 2023-10-13 05:38:24,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1293044.6666666667, ans=0.0 2023-10-13 05:38:26,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1293091.3333333333, ans=0.04949747468305833 2023-10-13 05:38:43,490 INFO [train.py:1031] (0/4) Epoch 21, batch 4000, loss[loss=0.1773, simple_loss=0.2733, pruned_loss=0.04062, over 16882.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2804, pruned_loss=0.04924, over 28387339.93 frames. ], batch size: 130, lr: 1.65e-03, grad_scale: 32.0 2023-10-13 05:38:54,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1293138.0, ans=0.2 2023-10-13 05:39:00,380 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.21 vs. limit=15.0 2023-10-13 05:39:06,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1293184.6666666667, ans=0.1 2023-10-13 05:39:18,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1293231.3333333333, ans=0.125 2023-10-13 05:39:21,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1293231.3333333333, ans=0.125 2023-10-13 05:39:47,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1293324.6666666667, ans=0.125 2023-10-13 05:39:51,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1293371.3333333333, ans=0.125 2023-10-13 05:39:56,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1293371.3333333333, ans=0.125 2023-10-13 05:40:02,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293371.3333333333, ans=0.1 2023-10-13 05:40:10,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.804e+02 1.922e+02 2.196e+02 3.017e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-13 05:40:15,846 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=22.5 2023-10-13 05:40:53,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.22 vs. limit=15.0 2023-10-13 05:41:09,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1293604.6666666667, ans=0.125 2023-10-13 05:41:14,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1293604.6666666667, ans=0.125 2023-10-13 05:41:16,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1293604.6666666667, ans=0.0 2023-10-13 05:41:38,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1293698.0, ans=0.0 2023-10-13 05:41:44,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-10-13 05:42:03,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.67 vs. limit=15.0 2023-10-13 05:42:36,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.826e+02 2.030e+02 2.272e+02 3.180e+02, threshold=4.060e+02, percent-clipped=0.0 2023-10-13 05:43:13,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1293978.0, ans=0.95 2023-10-13 05:43:25,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1294024.6666666667, ans=0.0 2023-10-13 05:43:43,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1294071.3333333333, ans=0.125 2023-10-13 05:43:48,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1294071.3333333333, ans=0.05 2023-10-13 05:43:53,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1294118.0, ans=0.0 2023-10-13 05:44:01,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1294118.0, ans=0.0 2023-10-13 05:44:34,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1294211.3333333333, ans=0.1 2023-10-13 05:44:53,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1294304.6666666667, ans=0.015 2023-10-13 05:44:56,848 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-10-13 05:44:58,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1294304.6666666667, ans=0.2 2023-10-13 05:45:07,995 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-10-13 05:45:08,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.744e+02 1.978e+02 2.232e+02 3.307e+02, threshold=3.957e+02, percent-clipped=0.0 2023-10-13 05:45:13,397 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-13 05:45:20,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1294398.0, ans=0.125 2023-10-13 05:45:32,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1294444.6666666667, ans=0.125 2023-10-13 05:45:46,243 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:45:48,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1294491.3333333333, ans=0.025 2023-10-13 05:45:48,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1294491.3333333333, ans=0.125 2023-10-13 05:46:02,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1294538.0, ans=0.125 2023-10-13 05:46:08,052 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:46:19,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1294584.6666666667, ans=0.125 2023-10-13 05:46:20,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1294584.6666666667, ans=0.125 2023-10-13 05:46:21,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.40 vs. limit=12.0 2023-10-13 05:46:22,779 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:46:24,090 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.11 vs. limit=15.0 2023-10-13 05:46:36,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1294678.0, ans=0.1 2023-10-13 05:47:06,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1294771.3333333333, ans=0.2 2023-10-13 05:47:10,829 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.89 vs. limit=15.0 2023-10-13 05:47:20,586 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.881e+02 2.068e+02 2.321e+02 3.179e+02, threshold=4.137e+02, percent-clipped=0.0 2023-10-13 05:47:21,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1294818.0, ans=0.1 2023-10-13 05:47:25,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1294818.0, ans=0.125 2023-10-13 05:47:29,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1294818.0, ans=0.0 2023-10-13 05:47:33,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1294864.6666666667, ans=0.125 2023-10-13 05:48:11,948 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.49 vs. limit=10.0 2023-10-13 05:48:17,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1295004.6666666667, ans=0.125 2023-10-13 05:48:24,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1295051.3333333333, ans=10.0 2023-10-13 05:48:41,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1295098.0, ans=0.125 2023-10-13 05:48:46,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1295098.0, ans=0.2 2023-10-13 05:49:12,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1295191.3333333333, ans=0.1 2023-10-13 05:49:17,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1295191.3333333333, ans=0.05 2023-10-13 05:49:41,138 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.16 vs. limit=22.5 2023-10-13 05:49:46,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 1.867e+02 2.052e+02 2.160e+02 3.021e+02, threshold=4.103e+02, percent-clipped=0.0 2023-10-13 05:49:57,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1295331.3333333333, ans=0.1 2023-10-13 05:50:27,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1295424.6666666667, ans=0.2 2023-10-13 05:50:34,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1295424.6666666667, ans=0.125 2023-10-13 05:50:44,219 INFO [train.py:1031] (0/4) Epoch 21, batch 4500, loss[loss=0.1926, simple_loss=0.2838, pruned_loss=0.05075, over 16567.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2809, pruned_loss=0.04913, over 29371264.19 frames. ], batch size: 266, lr: 1.65e-03, grad_scale: 32.0 2023-10-13 05:50:51,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.46 vs. limit=12.0 2023-10-13 05:50:56,138 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.03 vs. limit=15.0 2023-10-13 05:51:17,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1295564.6666666667, ans=0.0 2023-10-13 05:51:21,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1295564.6666666667, ans=0.125 2023-10-13 05:51:25,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1295564.6666666667, ans=0.125 2023-10-13 05:51:25,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.36 vs. limit=10.0 2023-10-13 05:51:51,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1295658.0, ans=0.0 2023-10-13 05:52:18,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.737e+02 1.943e+02 2.144e+02 2.579e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-13 05:52:19,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1295751.3333333333, ans=10.0 2023-10-13 05:52:22,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1295751.3333333333, ans=0.07 2023-10-13 05:52:24,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1295751.3333333333, ans=0.1 2023-10-13 05:52:48,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1295844.6666666667, ans=0.2 2023-10-13 05:52:48,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1295844.6666666667, ans=0.125 2023-10-13 05:52:57,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1295891.3333333333, ans=0.125 2023-10-13 05:52:57,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1295891.3333333333, ans=0.2 2023-10-13 05:53:04,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1295938.0, ans=0.05 2023-10-13 05:53:06,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1295938.0, ans=0.0 2023-10-13 05:53:23,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1295984.6666666667, ans=0.125 2023-10-13 05:53:32,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1296031.3333333333, ans=0.2 2023-10-13 05:53:36,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1296031.3333333333, ans=0.125 2023-10-13 05:53:40,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1296031.3333333333, ans=0.0 2023-10-13 05:53:47,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1296078.0, ans=0.125 2023-10-13 05:53:57,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1296078.0, ans=0.2 2023-10-13 05:53:58,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1296124.6666666667, ans=0.1 2023-10-13 05:53:58,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1296124.6666666667, ans=0.125 2023-10-13 05:53:59,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1296124.6666666667, ans=0.5 2023-10-13 05:54:22,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1296218.0, ans=0.1 2023-10-13 05:54:27,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.795e+02 1.940e+02 2.155e+02 2.996e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-13 05:55:15,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1296358.0, ans=0.125 2023-10-13 05:55:18,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1296404.6666666667, ans=0.125 2023-10-13 05:55:30,595 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-10-13 05:55:34,650 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.88 vs. limit=10.0 2023-10-13 05:55:39,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1296451.3333333333, ans=0.1 2023-10-13 05:55:41,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1296451.3333333333, ans=0.125 2023-10-13 05:56:02,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1296498.0, ans=0.1 2023-10-13 05:56:18,572 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-10-13 05:56:23,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.17 vs. limit=22.5 2023-10-13 05:56:40,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-13 05:56:50,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.808e+02 1.932e+02 2.125e+02 2.753e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-13 05:56:50,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1296684.6666666667, ans=0.0 2023-10-13 05:56:57,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1296731.3333333333, ans=0.1 2023-10-13 05:57:14,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1296778.0, ans=0.1 2023-10-13 05:57:20,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1296778.0, ans=15.0 2023-10-13 05:57:30,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1296824.6666666667, ans=0.0 2023-10-13 05:57:38,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1296871.3333333333, ans=0.0 2023-10-13 05:57:55,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1296918.0, ans=0.1 2023-10-13 05:58:33,236 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-10-13 05:58:51,463 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 05:59:00,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1297104.6666666667, ans=0.2 2023-10-13 05:59:04,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1297151.3333333333, ans=0.0 2023-10-13 05:59:04,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1297151.3333333333, ans=22.5 2023-10-13 05:59:08,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.739e+02 1.898e+02 2.103e+02 2.651e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-13 05:59:11,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1297151.3333333333, ans=0.125 2023-10-13 05:59:26,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1297198.0, ans=0.2 2023-10-13 05:59:40,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1297244.6666666667, ans=0.2 2023-10-13 05:59:54,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1297291.3333333333, ans=0.0 2023-10-13 06:00:34,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1297431.3333333333, ans=0.125 2023-10-13 06:00:35,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1297431.3333333333, ans=0.1 2023-10-13 06:01:02,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1297478.0, ans=0.1 2023-10-13 06:01:07,719 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.57 vs. limit=22.5 2023-10-13 06:01:42,298 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.754e+02 1.946e+02 2.192e+02 2.891e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-13 06:01:46,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1297618.0, ans=0.125 2023-10-13 06:02:00,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1297664.6666666667, ans=0.125 2023-10-13 06:02:04,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1297711.3333333333, ans=0.0 2023-10-13 06:02:22,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1297758.0, ans=0.1 2023-10-13 06:02:36,576 INFO [train.py:1031] (0/4) Epoch 21, batch 5000, loss[loss=0.1973, simple_loss=0.28, pruned_loss=0.05728, over 15949.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2809, pruned_loss=0.04926, over 30152478.07 frames. ], batch size: 43, lr: 1.65e-03, grad_scale: 16.0 2023-10-13 06:03:00,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1297851.3333333333, ans=0.0 2023-10-13 06:03:24,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1297898.0, ans=0.0 2023-10-13 06:03:29,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1297944.6666666667, ans=0.0 2023-10-13 06:03:58,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.28 vs. limit=6.0 2023-10-13 06:04:21,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1298084.6666666667, ans=0.125 2023-10-13 06:04:25,194 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.823e+02 2.046e+02 2.302e+02 3.322e+02, threshold=4.093e+02, percent-clipped=0.0 2023-10-13 06:04:38,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1298131.3333333333, ans=0.125 2023-10-13 06:04:55,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1298131.3333333333, ans=0.5 2023-10-13 06:04:58,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1298178.0, ans=0.2 2023-10-13 06:05:04,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1298178.0, ans=0.5 2023-10-13 06:05:12,120 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.72 vs. limit=15.0 2023-10-13 06:05:14,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1298224.6666666667, ans=0.125 2023-10-13 06:05:25,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1298224.6666666667, ans=0.07 2023-10-13 06:05:57,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1298318.0, ans=0.0 2023-10-13 06:06:17,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-10-13 06:06:43,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1298458.0, ans=0.125 2023-10-13 06:06:50,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1298504.6666666667, ans=0.125 2023-10-13 06:07:07,045 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=6.772e-02 2023-10-13 06:07:11,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1298551.3333333333, ans=0.0 2023-10-13 06:07:12,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.370e+02 1.795e+02 2.030e+02 2.250e+02 3.958e+02, threshold=4.061e+02, percent-clipped=0.0 2023-10-13 06:07:14,162 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2023-10-13 06:07:18,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1298551.3333333333, ans=22.5 2023-10-13 06:07:19,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=12.0 2023-10-13 06:07:37,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.67 vs. limit=15.0 2023-10-13 06:08:22,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1298738.0, ans=0.0 2023-10-13 06:08:30,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1298784.6666666667, ans=0.125 2023-10-13 06:08:37,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=1298831.3333333333, ans=0.02 2023-10-13 06:09:05,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1298878.0, ans=0.0 2023-10-13 06:09:08,569 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.19 vs. limit=10.0 2023-10-13 06:09:18,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1298924.6666666667, ans=0.125 2023-10-13 06:09:46,295 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.777e+02 1.996e+02 2.257e+02 2.950e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 06:09:53,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.07 vs. limit=15.0 2023-10-13 06:09:59,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1299064.6666666667, ans=0.2 2023-10-13 06:10:06,926 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-10-13 06:10:45,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1299204.6666666667, ans=0.125 2023-10-13 06:11:05,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1299251.3333333333, ans=0.125 2023-10-13 06:11:24,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1299344.6666666667, ans=0.0 2023-10-13 06:11:42,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1299391.3333333333, ans=0.2 2023-10-13 06:12:10,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1299438.0, ans=0.0 2023-10-13 06:12:22,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1299484.6666666667, ans=0.0 2023-10-13 06:12:24,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.663e+02 1.793e+02 1.987e+02 2.969e+02, threshold=3.586e+02, percent-clipped=0.0 2023-10-13 06:12:37,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1299531.3333333333, ans=0.1 2023-10-13 06:12:39,654 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-10-13 06:12:43,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1299531.3333333333, ans=0.025 2023-10-13 06:13:14,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.54 vs. limit=15.0 2023-10-13 06:13:46,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1299718.0, ans=0.125 2023-10-13 06:13:48,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1299718.0, ans=0.07 2023-10-13 06:14:37,250 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.25 vs. limit=15.0 2023-10-13 06:14:39,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299858.0, ans=0.1 2023-10-13 06:14:48,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1299904.6666666667, ans=0.125 2023-10-13 06:14:52,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299904.6666666667, ans=0.1 2023-10-13 06:14:52,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1299904.6666666667, ans=0.1 2023-10-13 06:15:07,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.775e+02 1.923e+02 2.083e+02 2.868e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-13 06:15:28,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.95 vs. limit=15.0 2023-10-13 06:15:29,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1300044.6666666667, ans=0.125 2023-10-13 06:15:55,175 INFO [train.py:1031] (0/4) Epoch 21, batch 5500, loss[loss=0.1789, simple_loss=0.2768, pruned_loss=0.04051, over 16812.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2807, pruned_loss=0.04926, over 30709822.77 frames. ], batch size: 188, lr: 1.65e-03, grad_scale: 8.0 2023-10-13 06:16:05,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.74 vs. limit=15.0 2023-10-13 06:16:05,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1300138.0, ans=0.125 2023-10-13 06:16:09,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1300184.6666666667, ans=0.0 2023-10-13 06:16:32,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1300231.3333333333, ans=0.0 2023-10-13 06:16:41,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1300278.0, ans=0.125 2023-10-13 06:16:47,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1300278.0, ans=0.0 2023-10-13 06:17:02,689 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=15.0 2023-10-13 06:17:26,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.746e+02 1.896e+02 2.183e+02 3.374e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-13 06:17:30,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.90 vs. limit=22.5 2023-10-13 06:17:41,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1300464.6666666667, ans=0.025 2023-10-13 06:17:50,216 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=22.5 2023-10-13 06:18:43,817 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-10-13 06:19:20,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1300791.3333333333, ans=0.125 2023-10-13 06:19:32,225 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.34 vs. limit=12.0 2023-10-13 06:19:33,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1300791.3333333333, ans=0.0 2023-10-13 06:20:03,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.760e+02 1.913e+02 2.199e+02 3.182e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-13 06:20:10,558 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.80 vs. limit=15.0 2023-10-13 06:20:48,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1301024.6666666667, ans=0.125 2023-10-13 06:21:43,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1301164.6666666667, ans=0.0 2023-10-13 06:21:58,345 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:22:07,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.62 vs. limit=15.0 2023-10-13 06:22:16,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1301258.0, ans=0.125 2023-10-13 06:22:25,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1301304.6666666667, ans=0.125 2023-10-13 06:22:27,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1301304.6666666667, ans=0.0 2023-10-13 06:22:33,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1301351.3333333333, ans=0.125 2023-10-13 06:22:40,920 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.817e+02 1.978e+02 2.194e+02 2.878e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-13 06:22:48,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1301398.0, ans=0.04949747468305833 2023-10-13 06:22:54,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1301398.0, ans=0.2 2023-10-13 06:23:01,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1301444.6666666667, ans=0.1 2023-10-13 06:23:19,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1301491.3333333333, ans=0.0 2023-10-13 06:23:25,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-10-13 06:23:46,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1301584.6666666667, ans=0.125 2023-10-13 06:23:49,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1301584.6666666667, ans=0.125 2023-10-13 06:23:52,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1301584.6666666667, ans=0.2 2023-10-13 06:24:39,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1301724.6666666667, ans=0.125 2023-10-13 06:24:46,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1301724.6666666667, ans=0.04949747468305833 2023-10-13 06:24:53,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1301724.6666666667, ans=0.125 2023-10-13 06:24:54,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1301771.3333333333, ans=0.2 2023-10-13 06:25:04,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1301771.3333333333, ans=0.125 2023-10-13 06:25:16,906 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:25:21,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.713e+02 1.903e+02 2.182e+02 3.667e+02, threshold=3.807e+02, percent-clipped=0.0 2023-10-13 06:25:51,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1301911.3333333333, ans=0.2 2023-10-13 06:25:58,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1301911.3333333333, ans=0.1 2023-10-13 06:26:16,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1301958.0, ans=0.0 2023-10-13 06:26:21,397 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-10-13 06:26:28,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1302004.6666666667, ans=0.125 2023-10-13 06:26:47,051 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-10-13 06:27:13,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1302144.6666666667, ans=0.0 2023-10-13 06:27:50,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1302238.0, ans=0.125 2023-10-13 06:28:02,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.810e+02 2.076e+02 2.283e+02 3.214e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-13 06:28:07,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.01 vs. limit=15.0 2023-10-13 06:28:22,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-10-13 06:28:29,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1302378.0, ans=0.2 2023-10-13 06:28:39,708 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.51 vs. limit=22.5 2023-10-13 06:28:51,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1302424.6666666667, ans=0.1 2023-10-13 06:28:59,748 INFO [train.py:1031] (0/4) Epoch 21, batch 6000, loss[loss=0.1875, simple_loss=0.2823, pruned_loss=0.04633, over 16591.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2813, pruned_loss=0.04962, over 31184035.86 frames. ], batch size: 56, lr: 1.65e-03, grad_scale: 32.0 2023-10-13 06:29:29,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.53 vs. limit=12.0 2023-10-13 06:30:05,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1302704.6666666667, ans=0.125 2023-10-13 06:30:46,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.819e+02 1.986e+02 2.196e+02 3.283e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-13 06:30:53,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1302798.0, ans=0.2 2023-10-13 06:30:53,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-13 06:31:48,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1302938.0, ans=0.125 2023-10-13 06:32:11,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.76 vs. limit=15.0 2023-10-13 06:33:04,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1303124.6666666667, ans=0.125 2023-10-13 06:33:25,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.62 vs. limit=15.0 2023-10-13 06:33:51,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.815e+02 1.942e+02 2.124e+02 2.826e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 06:34:22,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1303311.3333333333, ans=0.125 2023-10-13 06:34:27,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1303311.3333333333, ans=0.0 2023-10-13 06:35:21,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1303451.3333333333, ans=0.5 2023-10-13 06:35:44,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.15 vs. limit=15.0 2023-10-13 06:36:04,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.85 vs. limit=15.0 2023-10-13 06:36:10,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1303591.3333333333, ans=0.125 2023-10-13 06:36:17,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1303591.3333333333, ans=0.125 2023-10-13 06:36:18,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1303591.3333333333, ans=0.0 2023-10-13 06:36:29,040 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-10-13 06:37:00,954 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.815e+02 2.072e+02 2.385e+02 3.282e+02, threshold=4.143e+02, percent-clipped=0.0 2023-10-13 06:37:50,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1303824.6666666667, ans=0.0 2023-10-13 06:38:15,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1303871.3333333333, ans=0.125 2023-10-13 06:38:21,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1303871.3333333333, ans=0.125 2023-10-13 06:38:22,627 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.16 vs. limit=22.5 2023-10-13 06:38:33,967 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:39:16,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1304011.3333333333, ans=0.125 2023-10-13 06:39:23,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1304011.3333333333, ans=0.125 2023-10-13 06:39:27,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1304011.3333333333, ans=0.1 2023-10-13 06:39:28,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1304011.3333333333, ans=0.1 2023-10-13 06:39:32,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.35 vs. limit=22.5 2023-10-13 06:39:38,410 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:39:39,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1304058.0, ans=0.0 2023-10-13 06:40:16,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1304151.3333333333, ans=0.125 2023-10-13 06:40:23,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.844e+02 1.968e+02 2.249e+02 3.506e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-13 06:40:32,865 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-10-13 06:41:01,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1304244.6666666667, ans=0.025 2023-10-13 06:41:04,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1304244.6666666667, ans=0.2 2023-10-13 06:41:11,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1304291.3333333333, ans=0.125 2023-10-13 06:42:06,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1304431.3333333333, ans=0.125 2023-10-13 06:42:32,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1304478.0, ans=0.125 2023-10-13 06:42:54,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1304571.3333333333, ans=0.125 2023-10-13 06:43:09,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-10-13 06:43:32,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.734e+02 1.914e+02 2.144e+02 2.780e+02, threshold=3.829e+02, percent-clipped=0.0 2023-10-13 06:44:05,762 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-10-13 06:44:47,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1304804.6666666667, ans=0.0 2023-10-13 06:44:48,427 INFO [train.py:1031] (0/4) Epoch 21, batch 6500, loss[loss=0.1801, simple_loss=0.2684, pruned_loss=0.04587, over 15655.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2816, pruned_loss=0.04959, over 31546404.88 frames. ], batch size: 35, lr: 1.65e-03, grad_scale: 16.0 2023-10-13 06:45:35,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=15.0 2023-10-13 06:45:56,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1304944.6666666667, ans=0.05 2023-10-13 06:46:32,817 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 06:47:27,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1305084.6666666667, ans=0.04949747468305833 2023-10-13 06:47:29,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.811e+02 2.008e+02 2.196e+02 3.105e+02, threshold=4.016e+02, percent-clipped=0.0 2023-10-13 06:48:29,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1305224.6666666667, ans=0.0 2023-10-13 06:48:31,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1305224.6666666667, ans=0.125 2023-10-13 06:48:50,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.55 vs. limit=22.5 2023-10-13 06:49:25,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.07 vs. limit=10.0 2023-10-13 06:49:40,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.84 vs. limit=12.0 2023-10-13 06:51:11,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.776e+02 1.935e+02 2.243e+02 2.884e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-13 06:51:32,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=9.75 vs. limit=12.0 2023-10-13 06:51:48,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1305644.6666666667, ans=0.125 2023-10-13 06:51:56,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1305644.6666666667, ans=0.125 2023-10-13 06:52:18,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1305691.3333333333, ans=0.0 2023-10-13 06:52:27,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1305738.0, ans=0.1 2023-10-13 06:52:56,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1305784.6666666667, ans=0.125 2023-10-13 06:53:34,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1305878.0, ans=0.1 2023-10-13 06:53:38,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1305878.0, ans=0.125 2023-10-13 06:54:11,848 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.92 vs. limit=12.0 2023-10-13 06:54:33,111 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.771e+02 1.941e+02 2.216e+02 2.930e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-13 06:54:54,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1306064.6666666667, ans=0.125 2023-10-13 06:55:18,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1306158.0, ans=0.125 2023-10-13 06:55:55,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1306251.3333333333, ans=0.1 2023-10-13 06:55:56,158 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=12.0 2023-10-13 06:56:40,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1306344.6666666667, ans=0.2 2023-10-13 06:56:46,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1306344.6666666667, ans=0.0 2023-10-13 06:57:11,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-10-13 06:57:27,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1306438.0, ans=0.0 2023-10-13 06:57:57,786 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.703e+02 1.862e+02 2.052e+02 3.179e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-13 06:59:15,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.27 vs. limit=12.0 2023-10-13 06:59:27,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1306624.6666666667, ans=0.1 2023-10-13 06:59:28,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1306624.6666666667, ans=0.125 2023-10-13 06:59:39,370 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-280000.pt 2023-10-13 07:00:16,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1306718.0, ans=0.1 2023-10-13 07:01:00,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1306811.3333333333, ans=0.0 2023-10-13 07:01:18,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1306811.3333333333, ans=0.0 2023-10-13 07:01:24,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1306858.0, ans=0.0 2023-10-13 07:01:25,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=1306858.0, ans=22.5 2023-10-13 07:02:02,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1306904.6666666667, ans=0.2 2023-10-13 07:02:31,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.802e+02 1.962e+02 2.271e+02 2.953e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-13 07:03:22,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1307091.3333333333, ans=0.0 2023-10-13 07:03:40,102 INFO [train.py:1031] (0/4) Epoch 21, batch 7000, loss[loss=0.1994, simple_loss=0.2942, pruned_loss=0.05227, over 16413.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.282, pruned_loss=0.04961, over 31811926.86 frames. ], batch size: 50, lr: 1.65e-03, grad_scale: 32.0 2023-10-13 07:03:48,158 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.21 vs. limit=10.0 2023-10-13 07:04:39,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1307231.3333333333, ans=0.125 2023-10-13 07:05:02,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1307278.0, ans=0.125 2023-10-13 07:05:03,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1307278.0, ans=0.125 2023-10-13 07:05:08,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.22 vs. limit=10.0 2023-10-13 07:05:50,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1307371.3333333333, ans=0.125 2023-10-13 07:05:50,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1307371.3333333333, ans=0.125 2023-10-13 07:06:11,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1307418.0, ans=0.09899494936611666 2023-10-13 07:06:12,516 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.577e-03 2023-10-13 07:06:22,767 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.764e+02 1.976e+02 2.167e+02 3.433e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-13 07:06:33,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1307464.6666666667, ans=0.125 2023-10-13 07:07:03,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1307511.3333333333, ans=0.125 2023-10-13 07:08:57,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1307791.3333333333, ans=0.0 2023-10-13 07:09:01,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1307791.3333333333, ans=0.1 2023-10-13 07:09:10,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1307791.3333333333, ans=0.5 2023-10-13 07:09:15,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1307838.0, ans=0.125 2023-10-13 07:09:33,645 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.25 vs. limit=15.0 2023-10-13 07:09:44,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.860e+02 2.024e+02 2.276e+02 3.097e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-13 07:09:58,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1307978.0, ans=0.0 2023-10-13 07:10:07,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1307978.0, ans=0.125 2023-10-13 07:10:12,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1308024.6666666667, ans=0.2 2023-10-13 07:10:30,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1308071.3333333333, ans=0.125 2023-10-13 07:10:32,698 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-10-13 07:10:39,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1308118.0, ans=0.125 2023-10-13 07:10:52,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1308118.0, ans=0.1 2023-10-13 07:11:02,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1308164.6666666667, ans=0.09899494936611666 2023-10-13 07:11:36,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1308304.6666666667, ans=0.125 2023-10-13 07:11:40,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1308304.6666666667, ans=0.04949747468305833 2023-10-13 07:11:52,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1308351.3333333333, ans=0.0 2023-10-13 07:12:01,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.735e+02 1.954e+02 2.252e+02 3.130e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-13 07:12:16,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1308444.6666666667, ans=0.125 2023-10-13 07:12:27,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1308491.3333333333, ans=0.2 2023-10-13 07:12:28,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1308491.3333333333, ans=0.125 2023-10-13 07:13:17,595 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.66 vs. limit=15.0 2023-10-13 07:13:21,121 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.19 vs. limit=15.0 2023-10-13 07:13:22,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1308678.0, ans=0.0 2023-10-13 07:13:24,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1308678.0, ans=0.125 2023-10-13 07:13:52,730 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.24 vs. limit=15.0 2023-10-13 07:14:05,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1308864.6666666667, ans=0.1 2023-10-13 07:14:06,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.757e+02 1.928e+02 2.175e+02 2.976e+02, threshold=3.856e+02, percent-clipped=0.0 2023-10-13 07:14:20,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1308911.3333333333, ans=0.125 2023-10-13 07:14:21,562 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-13 07:14:27,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1308911.3333333333, ans=0.0 2023-10-13 07:14:41,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1309004.6666666667, ans=0.0 2023-10-13 07:14:55,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1309051.3333333333, ans=0.125 2023-10-13 07:15:00,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309051.3333333333, ans=0.1 2023-10-13 07:15:02,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1309098.0, ans=0.125 2023-10-13 07:15:15,654 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:15:22,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1309144.6666666667, ans=0.125 2023-10-13 07:15:47,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.39 vs. limit=15.0 2023-10-13 07:15:56,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1309331.3333333333, ans=0.0 2023-10-13 07:15:57,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.802e+02 1.981e+02 2.180e+02 3.151e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-13 07:16:13,275 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:16:32,249 INFO [train.py:1031] (0/4) Epoch 21, batch 7500, loss[loss=0.194, simple_loss=0.2781, pruned_loss=0.05499, over 15449.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2819, pruned_loss=0.04966, over 32003227.52 frames. ], batch size: 35, lr: 1.64e-03, grad_scale: 16.0 2023-10-13 07:16:42,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1309518.0, ans=0.125 2023-10-13 07:16:52,830 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-10-13 07:17:07,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1309611.3333333333, ans=0.1 2023-10-13 07:17:08,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1309611.3333333333, ans=0.1 2023-10-13 07:17:09,135 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-10-13 07:17:17,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1309658.0, ans=0.0 2023-10-13 07:17:19,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1309658.0, ans=0.125 2023-10-13 07:17:21,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.01 vs. limit=22.5 2023-10-13 07:17:23,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309658.0, ans=0.1 2023-10-13 07:17:45,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1309751.3333333333, ans=0.0 2023-10-13 07:17:50,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.766e+02 1.921e+02 2.088e+02 2.962e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-13 07:18:00,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1309844.6666666667, ans=0.125 2023-10-13 07:18:06,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.20 vs. limit=15.0 2023-10-13 07:18:08,490 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-10-13 07:18:22,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1309938.0, ans=0.125 2023-10-13 07:18:28,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1309938.0, ans=0.2 2023-10-13 07:18:29,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1309938.0, ans=0.125 2023-10-13 07:18:44,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1310031.3333333333, ans=0.0 2023-10-13 07:18:44,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1310031.3333333333, ans=0.0 2023-10-13 07:18:59,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1310078.0, ans=0.125 2023-10-13 07:19:35,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1310171.3333333333, ans=0.1 2023-10-13 07:19:37,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.95 vs. limit=22.5 2023-10-13 07:19:57,886 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.727e+02 1.909e+02 2.192e+02 2.863e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-13 07:20:06,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1310264.6666666667, ans=0.125 2023-10-13 07:20:25,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1310358.0, ans=0.1 2023-10-13 07:20:47,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1310451.3333333333, ans=0.125 2023-10-13 07:20:54,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=12.0 2023-10-13 07:21:11,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=22.5 2023-10-13 07:21:15,945 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.98 vs. limit=6.0 2023-10-13 07:21:23,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1310591.3333333333, ans=0.125 2023-10-13 07:21:28,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-10-13 07:21:38,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1310638.0, ans=0.0 2023-10-13 07:21:46,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1310684.6666666667, ans=0.1 2023-10-13 07:21:49,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1310684.6666666667, ans=0.125 2023-10-13 07:21:49,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.65 vs. limit=15.0 2023-10-13 07:21:55,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.732e+02 1.923e+02 2.032e+02 2.652e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-13 07:22:32,503 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.79 vs. limit=15.0 2023-10-13 07:22:35,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-10-13 07:22:40,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1310918.0, ans=0.125 2023-10-13 07:22:43,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1310918.0, ans=0.2 2023-10-13 07:22:59,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1310964.6666666667, ans=0.125 2023-10-13 07:23:10,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.62 vs. limit=22.5 2023-10-13 07:23:15,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1311058.0, ans=0.0 2023-10-13 07:23:23,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1311058.0, ans=0.1 2023-10-13 07:23:30,867 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.60 vs. limit=15.0 2023-10-13 07:23:39,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1311151.3333333333, ans=0.2 2023-10-13 07:23:42,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1311151.3333333333, ans=0.1 2023-10-13 07:23:47,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1311151.3333333333, ans=0.125 2023-10-13 07:23:50,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1311151.3333333333, ans=0.125 2023-10-13 07:23:51,483 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.68 vs. limit=10.0 2023-10-13 07:23:53,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.773e+02 1.982e+02 2.214e+02 2.993e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-13 07:23:55,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1311198.0, ans=0.125 2023-10-13 07:24:24,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.74 vs. limit=15.0 2023-10-13 07:24:29,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1311338.0, ans=0.2 2023-10-13 07:24:48,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1311431.3333333333, ans=0.125 2023-10-13 07:24:52,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.17 vs. limit=22.5 2023-10-13 07:25:35,742 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.38 vs. limit=15.0 2023-10-13 07:25:38,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1311618.0, ans=0.025 2023-10-13 07:25:51,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.675e+02 1.765e+02 1.925e+02 2.697e+02, threshold=3.531e+02, percent-clipped=0.0 2023-10-13 07:25:51,864 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-10-13 07:26:10,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1311711.3333333333, ans=0.125 2023-10-13 07:26:13,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1311758.0, ans=0.1 2023-10-13 07:26:15,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1311758.0, ans=0.125 2023-10-13 07:26:18,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1311758.0, ans=0.125 2023-10-13 07:26:22,921 INFO [train.py:1031] (0/4) Epoch 21, batch 8000, loss[loss=0.1857, simple_loss=0.2796, pruned_loss=0.04588, over 16452.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2813, pruned_loss=0.04912, over 32199554.43 frames. ], batch size: 50, lr: 1.64e-03, grad_scale: 16.0 2023-10-13 07:26:27,803 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-10-13 07:26:28,820 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.01 vs. limit=22.5 2023-10-13 07:26:51,014 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.66 vs. limit=15.0 2023-10-13 07:27:21,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1312038.0, ans=0.125 2023-10-13 07:27:32,191 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:27:39,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1312131.3333333333, ans=0.0 2023-10-13 07:27:41,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.690e+02 1.891e+02 2.152e+02 3.196e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-13 07:27:44,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1312131.3333333333, ans=0.125 2023-10-13 07:27:46,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-13 07:27:52,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1312178.0, ans=0.0 2023-10-13 07:27:55,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=1312178.0, ans=15.0 2023-10-13 07:28:25,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1312318.0, ans=0.1 2023-10-13 07:28:31,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1312364.6666666667, ans=0.1 2023-10-13 07:28:46,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1312411.3333333333, ans=0.0 2023-10-13 07:28:55,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1312458.0, ans=0.035 2023-10-13 07:29:04,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1312458.0, ans=0.1 2023-10-13 07:29:11,363 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.45 vs. limit=6.0 2023-10-13 07:29:46,869 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.781e+02 1.990e+02 2.138e+02 2.951e+02, threshold=3.979e+02, percent-clipped=0.0 2023-10-13 07:29:50,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1312598.0, ans=0.0 2023-10-13 07:30:10,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1312691.3333333333, ans=0.125 2023-10-13 07:30:16,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1312691.3333333333, ans=0.2 2023-10-13 07:30:23,801 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.78 vs. limit=10.0 2023-10-13 07:30:25,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1312738.0, ans=0.125 2023-10-13 07:30:27,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1312738.0, ans=0.0 2023-10-13 07:30:43,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1312831.3333333333, ans=0.125 2023-10-13 07:30:50,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1312831.3333333333, ans=0.0 2023-10-13 07:30:50,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1312831.3333333333, ans=0.125 2023-10-13 07:31:00,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1312878.0, ans=0.125 2023-10-13 07:31:06,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1312924.6666666667, ans=0.125 2023-10-13 07:31:08,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.60 vs. limit=22.5 2023-10-13 07:31:22,967 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.81 vs. limit=15.0 2023-10-13 07:31:23,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1312971.3333333333, ans=0.125 2023-10-13 07:31:27,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1313018.0, ans=0.125 2023-10-13 07:31:40,637 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.749e+02 1.975e+02 2.410e+02 3.621e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-13 07:32:01,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1313158.0, ans=0.0 2023-10-13 07:32:56,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1313344.6666666667, ans=0.2 2023-10-13 07:32:57,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1313344.6666666667, ans=0.125 2023-10-13 07:32:57,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1313344.6666666667, ans=0.2 2023-10-13 07:32:57,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1313344.6666666667, ans=0.05 2023-10-13 07:33:23,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1313484.6666666667, ans=0.0 2023-10-13 07:33:36,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.898e+02 2.062e+02 2.388e+02 2.998e+02, threshold=4.124e+02, percent-clipped=0.0 2023-10-13 07:33:43,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1313578.0, ans=0.125 2023-10-13 07:34:14,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1313671.3333333333, ans=0.125 2023-10-13 07:34:26,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1313718.0, ans=0.125 2023-10-13 07:34:54,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1313811.3333333333, ans=0.125 2023-10-13 07:34:54,380 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=15.0 2023-10-13 07:34:59,017 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.11 vs. limit=10.0 2023-10-13 07:35:00,117 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-10-13 07:35:20,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1313951.3333333333, ans=0.2 2023-10-13 07:35:29,125 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-10-13 07:35:31,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.823e+02 1.977e+02 2.190e+02 3.159e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-13 07:36:06,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1314091.3333333333, ans=0.125 2023-10-13 07:36:08,463 INFO [train.py:1031] (0/4) Epoch 21, batch 8500, loss[loss=0.2082, simple_loss=0.3013, pruned_loss=0.05752, over 16622.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2814, pruned_loss=0.04897, over 32321259.85 frames. ], batch size: 219, lr: 1.64e-03, grad_scale: 32.0 2023-10-13 07:36:08,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1314138.0, ans=0.125 2023-10-13 07:36:13,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1314138.0, ans=0.5 2023-10-13 07:36:27,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1314184.6666666667, ans=0.1 2023-10-13 07:36:35,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1314231.3333333333, ans=0.0 2023-10-13 07:36:36,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1314231.3333333333, ans=0.125 2023-10-13 07:36:46,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1314278.0, ans=0.0 2023-10-13 07:36:50,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1314278.0, ans=0.1 2023-10-13 07:37:31,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.788e+02 1.945e+02 2.120e+02 2.906e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-13 07:38:03,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1314558.0, ans=0.2 2023-10-13 07:38:14,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1314604.6666666667, ans=10.0 2023-10-13 07:38:19,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1314604.6666666667, ans=0.2 2023-10-13 07:38:26,398 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-10-13 07:38:43,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1314698.0, ans=0.125 2023-10-13 07:38:44,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1314698.0, ans=0.0 2023-10-13 07:38:48,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1314744.6666666667, ans=0.125 2023-10-13 07:39:03,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1314791.3333333333, ans=0.1 2023-10-13 07:39:09,426 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.60 vs. limit=10.0 2023-10-13 07:39:37,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.94 vs. limit=6.0 2023-10-13 07:39:37,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.784e+02 1.977e+02 2.317e+02 3.173e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-13 07:39:54,883 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:40:03,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1315024.6666666667, ans=0.125 2023-10-13 07:40:03,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1315024.6666666667, ans=0.1 2023-10-13 07:40:10,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1315024.6666666667, ans=0.125 2023-10-13 07:40:22,237 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.51 vs. limit=22.5 2023-10-13 07:40:22,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1315118.0, ans=0.1 2023-10-13 07:40:35,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1315164.6666666667, ans=0.125 2023-10-13 07:40:39,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1315164.6666666667, ans=0.125 2023-10-13 07:40:40,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1315164.6666666667, ans=0.125 2023-10-13 07:40:59,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1315258.0, ans=0.5 2023-10-13 07:41:04,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1315258.0, ans=0.1 2023-10-13 07:41:14,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1315304.6666666667, ans=0.125 2023-10-13 07:41:30,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1315351.3333333333, ans=0.1 2023-10-13 07:41:39,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.660e+02 1.792e+02 1.997e+02 2.662e+02, threshold=3.584e+02, percent-clipped=0.0 2023-10-13 07:41:41,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1315398.0, ans=0.0 2023-10-13 07:41:42,534 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.04 vs. limit=12.0 2023-10-13 07:42:16,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1315538.0, ans=0.125 2023-10-13 07:42:16,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1315538.0, ans=0.025 2023-10-13 07:42:35,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1315631.3333333333, ans=0.125 2023-10-13 07:42:41,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1315631.3333333333, ans=0.0 2023-10-13 07:42:55,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1315678.0, ans=0.1 2023-10-13 07:43:20,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1315818.0, ans=0.125 2023-10-13 07:43:29,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.691e+02 1.902e+02 2.166e+02 3.080e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-13 07:43:39,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1315911.3333333333, ans=0.1 2023-10-13 07:43:39,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1315911.3333333333, ans=0.125 2023-10-13 07:43:39,533 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-10-13 07:44:02,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1316004.6666666667, ans=0.125 2023-10-13 07:44:10,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.21 vs. limit=22.5 2023-10-13 07:44:14,871 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.28 vs. limit=15.0 2023-10-13 07:44:18,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1316051.3333333333, ans=0.125 2023-10-13 07:44:41,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1316144.6666666667, ans=0.1 2023-10-13 07:44:42,633 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.50 vs. limit=15.0 2023-10-13 07:44:46,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1316191.3333333333, ans=0.0 2023-10-13 07:44:46,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1316191.3333333333, ans=0.07 2023-10-13 07:44:47,907 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-13 07:44:47,914 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.30 vs. limit=10.0 2023-10-13 07:44:59,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1316238.0, ans=0.1 2023-10-13 07:45:02,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1316238.0, ans=0.0 2023-10-13 07:45:02,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1316238.0, ans=0.2 2023-10-13 07:45:16,772 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-10-13 07:45:19,495 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.777e+02 1.942e+02 2.146e+02 3.297e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 07:45:35,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1316378.0, ans=0.125 2023-10-13 07:45:41,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1316424.6666666667, ans=0.0 2023-10-13 07:45:49,470 INFO [train.py:1031] (0/4) Epoch 21, batch 9000, loss[loss=0.2024, simple_loss=0.2935, pruned_loss=0.05563, over 16553.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2809, pruned_loss=0.04886, over 32419616.80 frames. ], batch size: 219, lr: 1.64e-03, grad_scale: 32.0 2023-10-13 07:46:06,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1316518.0, ans=0.125 2023-10-13 07:46:08,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1316518.0, ans=0.125 2023-10-13 07:46:21,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1316564.6666666667, ans=0.2 2023-10-13 07:46:25,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1316611.3333333333, ans=0.1 2023-10-13 07:46:28,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1316611.3333333333, ans=0.1 2023-10-13 07:46:38,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1316658.0, ans=0.0 2023-10-13 07:46:41,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1316658.0, ans=0.125 2023-10-13 07:46:43,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1316704.6666666667, ans=0.125 2023-10-13 07:46:59,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1316751.3333333333, ans=0.125 2023-10-13 07:47:07,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.761e+02 1.924e+02 2.106e+02 2.641e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-13 07:47:24,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2023-10-13 07:47:26,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-10-13 07:48:00,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1317031.3333333333, ans=0.125 2023-10-13 07:48:03,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1317031.3333333333, ans=0.125 2023-10-13 07:48:03,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1317031.3333333333, ans=0.0 2023-10-13 07:48:03,404 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.66 vs. limit=15.0 2023-10-13 07:48:53,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.815e+02 1.967e+02 2.149e+02 2.760e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-13 07:49:10,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.88 vs. limit=22.5 2023-10-13 07:49:15,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1317358.0, ans=0.0 2023-10-13 07:49:28,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1317404.6666666667, ans=0.1 2023-10-13 07:49:32,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1317404.6666666667, ans=0.1 2023-10-13 07:49:37,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.64 vs. limit=15.0 2023-10-13 07:49:56,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1317544.6666666667, ans=10.0 2023-10-13 07:50:05,098 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:50:24,580 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.31 vs. limit=10.0 2023-10-13 07:50:26,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1317684.6666666667, ans=0.125 2023-10-13 07:50:32,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1317684.6666666667, ans=0.125 2023-10-13 07:50:37,761 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.808e+02 1.948e+02 2.163e+02 3.903e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-13 07:50:39,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1317731.3333333333, ans=0.125 2023-10-13 07:51:06,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1317871.3333333333, ans=0.125 2023-10-13 07:51:07,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1317871.3333333333, ans=0.125 2023-10-13 07:51:16,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1317918.0, ans=0.0 2023-10-13 07:51:17,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1317918.0, ans=0.125 2023-10-13 07:51:34,917 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:52:03,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1318104.6666666667, ans=0.125 2023-10-13 07:52:03,924 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-13 07:52:17,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1318151.3333333333, ans=0.0 2023-10-13 07:52:28,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1318151.3333333333, ans=0.2 2023-10-13 07:52:33,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1318198.0, ans=0.0 2023-10-13 07:52:35,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.827e+02 1.968e+02 2.182e+02 3.020e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-13 07:52:50,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1318244.6666666667, ans=0.125 2023-10-13 07:53:07,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.21 vs. limit=22.5 2023-10-13 07:53:09,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1318338.0, ans=0.125 2023-10-13 07:53:42,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1318478.0, ans=0.125 2023-10-13 07:54:31,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.841e+02 1.999e+02 2.301e+02 3.065e+02, threshold=3.999e+02, percent-clipped=0.0 2023-10-13 07:54:44,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1318711.3333333333, ans=0.125 2023-10-13 07:54:44,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1318711.3333333333, ans=0.1 2023-10-13 07:54:44,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-10-13 07:54:50,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.05 vs. limit=15.0 2023-10-13 07:54:50,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1318711.3333333333, ans=0.125 2023-10-13 07:54:55,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1318758.0, ans=0.125 2023-10-13 07:55:01,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1318758.0, ans=0.2 2023-10-13 07:55:03,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1318758.0, ans=0.1 2023-10-13 07:55:05,951 INFO [train.py:1031] (0/4) Epoch 21, batch 9500, loss[loss=0.1647, simple_loss=0.256, pruned_loss=0.03675, over 16607.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2816, pruned_loss=0.04906, over 32514644.21 frames. ], batch size: 66, lr: 1.64e-03, grad_scale: 32.0 2023-10-13 07:55:11,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1318804.6666666667, ans=0.125 2023-10-13 07:55:26,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.75 vs. limit=15.0 2023-10-13 07:55:44,195 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 07:55:50,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1318991.3333333333, ans=0.0 2023-10-13 07:55:54,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1318991.3333333333, ans=0.125 2023-10-13 07:56:00,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1318991.3333333333, ans=0.1 2023-10-13 07:56:26,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.750e+02 1.942e+02 2.191e+02 2.759e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 07:56:39,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1319178.0, ans=0.0 2023-10-13 07:56:45,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.53 vs. limit=15.0 2023-10-13 07:57:13,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1319318.0, ans=0.125 2023-10-13 07:57:15,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1319318.0, ans=0.09899494936611666 2023-10-13 07:57:34,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1319411.3333333333, ans=0.125 2023-10-13 07:57:34,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1319411.3333333333, ans=0.125 2023-10-13 07:57:35,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1319411.3333333333, ans=0.125 2023-10-13 07:57:37,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1319411.3333333333, ans=0.125 2023-10-13 07:57:54,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1319504.6666666667, ans=0.0 2023-10-13 07:58:21,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.766e+02 1.916e+02 2.137e+02 2.778e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-13 07:58:22,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.68 vs. limit=22.5 2023-10-13 07:58:25,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1319598.0, ans=0.125 2023-10-13 07:58:43,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1319691.3333333333, ans=0.07 2023-10-13 07:58:45,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1319691.3333333333, ans=0.1 2023-10-13 07:58:49,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1319691.3333333333, ans=0.2 2023-10-13 07:58:51,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1319691.3333333333, ans=0.0 2023-10-13 07:58:54,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1319738.0, ans=0.0 2023-10-13 07:58:54,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1319738.0, ans=0.125 2023-10-13 07:59:07,931 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.58 vs. limit=15.0 2023-10-13 07:59:09,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1319784.6666666667, ans=0.125 2023-10-13 07:59:17,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1319831.3333333333, ans=0.2 2023-10-13 07:59:21,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1319831.3333333333, ans=0.0 2023-10-13 07:59:39,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1319924.6666666667, ans=0.1 2023-10-13 07:59:47,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1319971.3333333333, ans=0.1 2023-10-13 08:00:04,393 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-10-13 08:00:11,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.767e+02 1.911e+02 2.117e+02 3.188e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-13 08:00:38,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1320158.0, ans=0.0 2023-10-13 08:01:14,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1320298.0, ans=0.0 2023-10-13 08:01:45,468 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-10-13 08:01:46,485 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.30 vs. limit=22.5 2023-10-13 08:01:52,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1320484.6666666667, ans=0.125 2023-10-13 08:02:00,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-10-13 08:02:05,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.740e+02 1.884e+02 2.038e+02 2.676e+02, threshold=3.768e+02, percent-clipped=0.0 2023-10-13 08:02:18,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1320578.0, ans=0.0 2023-10-13 08:03:15,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1320811.3333333333, ans=0.125 2023-10-13 08:03:33,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1320904.6666666667, ans=0.125 2023-10-13 08:03:36,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1320904.6666666667, ans=0.125 2023-10-13 08:03:37,298 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:03:42,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1320951.3333333333, ans=0.1 2023-10-13 08:03:55,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.729e+02 1.864e+02 2.107e+02 2.605e+02, threshold=3.728e+02, percent-clipped=0.0 2023-10-13 08:03:59,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1320998.0, ans=0.0 2023-10-13 08:04:03,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1321044.6666666667, ans=0.04949747468305833 2023-10-13 08:04:09,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1321044.6666666667, ans=0.125 2023-10-13 08:04:20,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1321091.3333333333, ans=0.125 2023-10-13 08:04:24,713 INFO [train.py:1031] (0/4) Epoch 21, batch 10000, loss[loss=0.1852, simple_loss=0.2814, pruned_loss=0.04449, over 16911.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2809, pruned_loss=0.04884, over 32586227.94 frames. ], batch size: 116, lr: 1.64e-03, grad_scale: 32.0 2023-10-13 08:04:46,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1321231.3333333333, ans=0.125 2023-10-13 08:05:01,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1321278.0, ans=0.0 2023-10-13 08:05:14,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1321324.6666666667, ans=0.125 2023-10-13 08:05:21,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1321371.3333333333, ans=0.0 2023-10-13 08:05:28,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1321418.0, ans=0.2 2023-10-13 08:05:44,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.900e+02 2.109e+02 2.308e+02 3.153e+02, threshold=4.218e+02, percent-clipped=0.0 2023-10-13 08:06:08,953 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:06:31,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1321651.3333333333, ans=0.125 2023-10-13 08:06:44,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1321698.0, ans=0.0 2023-10-13 08:06:49,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1321698.0, ans=0.0 2023-10-13 08:06:54,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1321744.6666666667, ans=0.125 2023-10-13 08:06:57,822 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:07:02,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1321791.3333333333, ans=0.0 2023-10-13 08:07:07,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1321791.3333333333, ans=0.0 2023-10-13 08:07:07,890 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.96 vs. limit=10.0 2023-10-13 08:07:09,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.24 vs. limit=22.5 2023-10-13 08:07:09,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.95 vs. limit=10.0 2023-10-13 08:07:37,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.771e+02 1.900e+02 2.072e+02 2.870e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-13 08:07:44,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1321931.3333333333, ans=0.1 2023-10-13 08:07:48,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1321978.0, ans=0.2 2023-10-13 08:08:02,152 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-10-13 08:08:23,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1322118.0, ans=0.125 2023-10-13 08:09:09,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1322304.6666666667, ans=0.125 2023-10-13 08:09:12,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1322304.6666666667, ans=0.125 2023-10-13 08:09:17,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1322304.6666666667, ans=0.0 2023-10-13 08:09:17,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1322304.6666666667, ans=0.125 2023-10-13 08:09:35,474 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.818e+02 1.973e+02 2.168e+02 2.968e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-13 08:09:47,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1322444.6666666667, ans=0.025 2023-10-13 08:10:13,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1322538.0, ans=0.015 2023-10-13 08:10:23,842 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.89 vs. limit=12.0 2023-10-13 08:10:25,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1322584.6666666667, ans=0.0 2023-10-13 08:10:32,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1322631.3333333333, ans=0.125 2023-10-13 08:10:34,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1322631.3333333333, ans=0.0 2023-10-13 08:10:34,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1322631.3333333333, ans=0.125 2023-10-13 08:10:39,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1322631.3333333333, ans=0.1 2023-10-13 08:10:54,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-10-13 08:11:26,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1322864.6666666667, ans=0.125 2023-10-13 08:11:29,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1322864.6666666667, ans=0.1 2023-10-13 08:11:30,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.735e+02 1.843e+02 2.021e+02 2.917e+02, threshold=3.686e+02, percent-clipped=0.0 2023-10-13 08:11:42,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1322911.3333333333, ans=0.0 2023-10-13 08:11:46,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1322911.3333333333, ans=0.125 2023-10-13 08:11:50,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1322958.0, ans=0.125 2023-10-13 08:11:56,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=15.0 2023-10-13 08:11:59,390 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.30 vs. limit=10.0 2023-10-13 08:12:01,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1322958.0, ans=10.0 2023-10-13 08:12:02,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1322958.0, ans=0.125 2023-10-13 08:12:08,954 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.49 vs. limit=15.0 2023-10-13 08:12:29,122 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:12:29,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1323098.0, ans=0.125 2023-10-13 08:12:34,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1323098.0, ans=0.125 2023-10-13 08:12:34,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1323098.0, ans=0.125 2023-10-13 08:12:54,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1323191.3333333333, ans=0.0 2023-10-13 08:13:06,655 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=15.0 2023-10-13 08:13:12,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1323238.0, ans=0.0 2023-10-13 08:13:30,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.714e+02 1.860e+02 2.070e+02 2.864e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-13 08:13:35,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1323331.3333333333, ans=0.125 2023-10-13 08:13:36,115 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.23 vs. limit=22.5 2023-10-13 08:13:46,270 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:13:58,690 INFO [train.py:1031] (0/4) Epoch 21, batch 10500, loss[loss=0.1852, simple_loss=0.2733, pruned_loss=0.04856, over 16630.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2815, pruned_loss=0.04907, over 32632493.79 frames. ], batch size: 56, lr: 1.64e-03, grad_scale: 16.0 2023-10-13 08:14:10,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1323518.0, ans=0.0 2023-10-13 08:14:17,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1323564.6666666667, ans=0.1 2023-10-13 08:14:25,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1323564.6666666667, ans=0.0 2023-10-13 08:14:25,708 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.99 vs. limit=10.0 2023-10-13 08:14:26,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1323564.6666666667, ans=0.125 2023-10-13 08:14:26,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1323564.6666666667, ans=0.125 2023-10-13 08:14:40,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1323658.0, ans=0.0 2023-10-13 08:14:52,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1323704.6666666667, ans=0.125 2023-10-13 08:15:11,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1323751.3333333333, ans=0.125 2023-10-13 08:15:21,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.744e+02 1.870e+02 2.068e+02 3.196e+02, threshold=3.740e+02, percent-clipped=0.0 2023-10-13 08:15:26,735 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.96 vs. limit=15.0 2023-10-13 08:15:42,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1323891.3333333333, ans=0.95 2023-10-13 08:16:01,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1323938.0, ans=0.0 2023-10-13 08:16:15,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1323984.6666666667, ans=0.1 2023-10-13 08:16:26,199 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.01 vs. limit=22.5 2023-10-13 08:16:39,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1324078.0, ans=0.0 2023-10-13 08:16:48,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1324124.6666666667, ans=0.025 2023-10-13 08:16:55,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1324171.3333333333, ans=0.125 2023-10-13 08:16:55,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1324171.3333333333, ans=0.0 2023-10-13 08:16:56,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1324171.3333333333, ans=0.025 2023-10-13 08:17:06,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1324218.0, ans=0.125 2023-10-13 08:17:10,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1324218.0, ans=0.125 2023-10-13 08:17:18,807 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.811e+02 1.996e+02 2.293e+02 3.645e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-13 08:17:28,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1324311.3333333333, ans=0.125 2023-10-13 08:17:41,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1324358.0, ans=0.2 2023-10-13 08:18:11,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1324451.3333333333, ans=0.1 2023-10-13 08:18:12,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1324451.3333333333, ans=0.125 2023-10-13 08:18:29,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1324544.6666666667, ans=0.0 2023-10-13 08:18:29,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1324544.6666666667, ans=0.1 2023-10-13 08:18:34,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1324544.6666666667, ans=0.125 2023-10-13 08:19:16,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1324731.3333333333, ans=0.2 2023-10-13 08:19:20,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.774e+02 1.966e+02 2.176e+02 2.762e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-13 08:19:20,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1324731.3333333333, ans=0.2 2023-10-13 08:19:21,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1324731.3333333333, ans=0.0 2023-10-13 08:19:23,062 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:19:30,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1324778.0, ans=22.5 2023-10-13 08:19:49,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1324871.3333333333, ans=0.1 2023-10-13 08:20:08,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1324964.6666666667, ans=0.125 2023-10-13 08:20:10,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1324964.6666666667, ans=0.025 2023-10-13 08:20:15,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1324964.6666666667, ans=0.125 2023-10-13 08:20:36,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1325058.0, ans=0.125 2023-10-13 08:20:51,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1325151.3333333333, ans=0.2 2023-10-13 08:20:56,169 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=15.0 2023-10-13 08:21:03,099 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.52 vs. limit=15.0 2023-10-13 08:21:13,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.752e+02 1.939e+02 2.140e+02 2.799e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-13 08:21:15,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-10-13 08:21:21,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1325244.6666666667, ans=0.1 2023-10-13 08:21:24,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1325244.6666666667, ans=0.1 2023-10-13 08:21:30,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1325291.3333333333, ans=0.1 2023-10-13 08:21:41,426 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-10-13 08:21:53,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.92 vs. limit=15.0 2023-10-13 08:22:00,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1325384.6666666667, ans=0.125 2023-10-13 08:22:03,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1325431.3333333333, ans=0.0 2023-10-13 08:22:16,675 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.12 vs. limit=15.0 2023-10-13 08:22:30,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1325524.6666666667, ans=0.0 2023-10-13 08:22:30,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1325524.6666666667, ans=0.125 2023-10-13 08:22:33,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.93 vs. limit=15.0 2023-10-13 08:22:52,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1325618.0, ans=0.0 2023-10-13 08:22:57,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.66 vs. limit=10.0 2023-10-13 08:23:00,882 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.738e+02 1.871e+02 2.120e+02 2.926e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-13 08:23:16,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1325711.3333333333, ans=0.1 2023-10-13 08:23:21,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1325758.0, ans=0.5 2023-10-13 08:23:21,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1325758.0, ans=0.04949747468305833 2023-10-13 08:23:28,983 INFO [train.py:1031] (0/4) Epoch 21, batch 11000, loss[loss=0.1968, simple_loss=0.2862, pruned_loss=0.05372, over 15927.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2814, pruned_loss=0.04914, over 32625972.30 frames. ], batch size: 43, lr: 1.63e-03, grad_scale: 16.0 2023-10-13 08:23:30,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1325804.6666666667, ans=0.125 2023-10-13 08:23:48,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1325851.3333333333, ans=0.09899494936611666 2023-10-13 08:23:58,695 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=12.0 2023-10-13 08:24:20,471 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.04 vs. limit=15.0 2023-10-13 08:24:28,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1326038.0, ans=0.125 2023-10-13 08:24:54,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.861e+02 1.969e+02 2.170e+02 2.905e+02, threshold=3.937e+02, percent-clipped=0.0 2023-10-13 08:25:03,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1326178.0, ans=0.125 2023-10-13 08:25:50,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1326364.6666666667, ans=0.2 2023-10-13 08:25:51,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-10-13 08:25:56,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1326364.6666666667, ans=0.125 2023-10-13 08:25:57,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1326364.6666666667, ans=0.125 2023-10-13 08:26:15,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1326458.0, ans=0.025 2023-10-13 08:26:26,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1326504.6666666667, ans=0.1 2023-10-13 08:26:27,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1326504.6666666667, ans=0.025 2023-10-13 08:26:43,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1326551.3333333333, ans=0.0 2023-10-13 08:26:56,434 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.704e+02 1.879e+02 2.084e+02 3.045e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-13 08:27:15,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1326691.3333333333, ans=0.0 2023-10-13 08:27:16,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1326691.3333333333, ans=0.125 2023-10-13 08:27:19,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1326691.3333333333, ans=0.0 2023-10-13 08:27:34,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1326784.6666666667, ans=0.2 2023-10-13 08:27:36,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1326784.6666666667, ans=0.1 2023-10-13 08:27:39,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1326784.6666666667, ans=0.125 2023-10-13 08:27:52,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1326831.3333333333, ans=0.0 2023-10-13 08:28:18,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1326971.3333333333, ans=0.125 2023-10-13 08:28:37,332 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:28:51,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.766e+02 1.894e+02 2.103e+02 2.601e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-13 08:28:58,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1327111.3333333333, ans=0.125 2023-10-13 08:29:04,460 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.36 vs. limit=22.5 2023-10-13 08:29:16,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1327158.0, ans=0.125 2023-10-13 08:29:17,723 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=12.0 2023-10-13 08:29:21,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1327204.6666666667, ans=0.125 2023-10-13 08:29:26,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1327204.6666666667, ans=0.125 2023-10-13 08:29:34,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1327251.3333333333, ans=0.0 2023-10-13 08:29:40,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1327251.3333333333, ans=0.0 2023-10-13 08:29:45,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1327298.0, ans=0.035 2023-10-13 08:29:49,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1327298.0, ans=0.125 2023-10-13 08:30:20,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1327438.0, ans=0.0 2023-10-13 08:30:40,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1327484.6666666667, ans=0.125 2023-10-13 08:30:48,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.759e+02 1.910e+02 2.121e+02 3.702e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-13 08:31:00,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1327578.0, ans=0.0 2023-10-13 08:31:16,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1327671.3333333333, ans=0.125 2023-10-13 08:31:18,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1327671.3333333333, ans=0.2 2023-10-13 08:31:56,565 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.67 vs. limit=15.0 2023-10-13 08:32:13,859 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.62 vs. limit=15.0 2023-10-13 08:32:25,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1327951.3333333333, ans=0.0 2023-10-13 08:32:40,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1327998.0, ans=0.125 2023-10-13 08:32:42,719 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.889e+02 2.027e+02 2.303e+02 3.111e+02, threshold=4.054e+02, percent-clipped=0.0 2023-10-13 08:32:46,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1328044.6666666667, ans=0.0 2023-10-13 08:32:54,543 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-10-13 08:32:56,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1328044.6666666667, ans=0.2 2023-10-13 08:33:03,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1328091.3333333333, ans=0.0 2023-10-13 08:33:09,363 INFO [train.py:1031] (0/4) Epoch 21, batch 11500, loss[loss=0.1709, simple_loss=0.2666, pruned_loss=0.03761, over 16360.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2812, pruned_loss=0.04907, over 32644935.85 frames. ], batch size: 50, lr: 1.63e-03, grad_scale: 16.0 2023-10-13 08:33:12,400 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-10-13 08:33:18,815 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-10-13 08:33:19,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.34 vs. limit=15.0 2023-10-13 08:33:22,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.22 vs. limit=15.0 2023-10-13 08:33:34,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1328231.3333333333, ans=0.125 2023-10-13 08:33:45,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1328278.0, ans=0.0 2023-10-13 08:34:29,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1328418.0, ans=0.125 2023-10-13 08:34:31,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1328464.6666666667, ans=0.125 2023-10-13 08:34:40,382 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.799e+02 1.897e+02 2.104e+02 2.824e+02, threshold=3.795e+02, percent-clipped=0.0 2023-10-13 08:34:43,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1328464.6666666667, ans=0.125 2023-10-13 08:34:56,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.71 vs. limit=15.0 2023-10-13 08:34:57,593 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=15.0 2023-10-13 08:35:01,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1328558.0, ans=0.025 2023-10-13 08:35:14,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1328604.6666666667, ans=0.0 2023-10-13 08:35:16,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1328604.6666666667, ans=0.2 2023-10-13 08:35:18,645 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.48 vs. limit=15.0 2023-10-13 08:35:46,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1328744.6666666667, ans=0.0 2023-10-13 08:36:09,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.91 vs. limit=15.0 2023-10-13 08:36:16,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1328884.6666666667, ans=0.07 2023-10-13 08:36:18,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1328884.6666666667, ans=0.1 2023-10-13 08:36:19,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1328884.6666666667, ans=0.0 2023-10-13 08:36:34,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.707e+02 1.849e+02 2.120e+02 2.681e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-13 08:37:27,890 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.64 vs. limit=8.0 2023-10-13 08:37:29,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1329164.6666666667, ans=0.1 2023-10-13 08:37:49,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1329258.0, ans=10.0 2023-10-13 08:37:58,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1329304.6666666667, ans=0.1 2023-10-13 08:38:03,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1329304.6666666667, ans=0.2 2023-10-13 08:38:07,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1329304.6666666667, ans=0.125 2023-10-13 08:38:17,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1329351.3333333333, ans=0.0 2023-10-13 08:38:23,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1329351.3333333333, ans=0.125 2023-10-13 08:38:35,672 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.768e+02 1.933e+02 2.170e+02 3.062e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-13 08:38:43,409 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:38:57,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1329491.3333333333, ans=0.2 2023-10-13 08:39:12,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1329584.6666666667, ans=0.0 2023-10-13 08:39:23,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1329584.6666666667, ans=0.125 2023-10-13 08:39:25,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1329631.3333333333, ans=0.125 2023-10-13 08:39:38,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1329678.0, ans=0.1 2023-10-13 08:39:55,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1329724.6666666667, ans=0.0 2023-10-13 08:40:13,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=15.0 2023-10-13 08:40:29,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.282e+02 1.718e+02 1.898e+02 2.079e+02 2.900e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-13 08:40:31,426 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.50 vs. limit=22.5 2023-10-13 08:40:44,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1329911.3333333333, ans=0.0 2023-10-13 08:40:48,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1329958.0, ans=0.125 2023-10-13 08:41:06,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1330004.6666666667, ans=0.0 2023-10-13 08:41:27,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1330098.0, ans=0.1 2023-10-13 08:41:27,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1330098.0, ans=0.125 2023-10-13 08:41:34,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1330144.6666666667, ans=0.0 2023-10-13 08:41:45,973 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:41:46,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1330191.3333333333, ans=0.125 2023-10-13 08:42:01,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1330238.0, ans=0.05 2023-10-13 08:42:21,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1330331.3333333333, ans=0.2 2023-10-13 08:42:24,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1330331.3333333333, ans=0.2 2023-10-13 08:42:28,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.776e+02 1.933e+02 2.227e+02 3.065e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-13 08:42:39,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1330378.0, ans=0.0 2023-10-13 08:42:44,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.80 vs. limit=10.0 2023-10-13 08:42:54,990 INFO [train.py:1031] (0/4) Epoch 21, batch 12000, loss[loss=0.1941, simple_loss=0.2926, pruned_loss=0.04783, over 16995.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2813, pruned_loss=0.04884, over 32689930.30 frames. ], batch size: 117, lr: 1.63e-03, grad_scale: 32.0 2023-10-13 08:43:00,895 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:43:28,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1330564.6666666667, ans=0.125 2023-10-13 08:43:31,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1330611.3333333333, ans=0.0 2023-10-13 08:43:31,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1330611.3333333333, ans=0.125 2023-10-13 08:43:53,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1330704.6666666667, ans=0.0 2023-10-13 08:44:09,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1330751.3333333333, ans=0.0 2023-10-13 08:44:11,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-13 08:44:23,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1330798.0, ans=0.125 2023-10-13 08:44:24,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.675e+02 1.778e+02 1.966e+02 2.595e+02, threshold=3.555e+02, percent-clipped=0.0 2023-10-13 08:44:26,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1330844.6666666667, ans=0.0 2023-10-13 08:44:31,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1330844.6666666667, ans=0.04949747468305833 2023-10-13 08:45:01,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1330984.6666666667, ans=0.125 2023-10-13 08:45:02,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1330984.6666666667, ans=0.125 2023-10-13 08:45:13,003 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-10-13 08:45:32,822 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.32 vs. limit=22.5 2023-10-13 08:45:53,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1331218.0, ans=0.1 2023-10-13 08:46:08,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1331264.6666666667, ans=0.125 2023-10-13 08:46:10,830 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.771e+02 1.958e+02 2.123e+02 2.822e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-13 08:46:18,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1331311.3333333333, ans=0.125 2023-10-13 08:46:23,152 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.33 vs. limit=10.0 2023-10-13 08:46:26,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1331358.0, ans=0.0 2023-10-13 08:46:35,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1331404.6666666667, ans=0.1 2023-10-13 08:46:41,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1331404.6666666667, ans=0.125 2023-10-13 08:46:52,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1331451.3333333333, ans=0.125 2023-10-13 08:46:55,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1331498.0, ans=0.05 2023-10-13 08:47:10,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1331544.6666666667, ans=0.125 2023-10-13 08:47:36,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1331638.0, ans=0.04949747468305833 2023-10-13 08:47:40,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.15 vs. limit=22.5 2023-10-13 08:47:58,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1331731.3333333333, ans=0.125 2023-10-13 08:48:00,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.935e+02 2.218e+02 2.487e+02 3.290e+02, threshold=4.437e+02, percent-clipped=0.0 2023-10-13 08:48:06,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1331778.0, ans=0.125 2023-10-13 08:48:09,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1331778.0, ans=0.0 2023-10-13 08:48:11,266 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.71 vs. limit=15.0 2023-10-13 08:48:12,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1331778.0, ans=0.09899494936611666 2023-10-13 08:48:19,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1331824.6666666667, ans=0.2 2023-10-13 08:48:31,208 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-10-13 08:48:37,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1331918.0, ans=0.2 2023-10-13 08:49:06,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-13 08:49:09,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1332058.0, ans=0.1 2023-10-13 08:49:16,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1332058.0, ans=0.125 2023-10-13 08:49:20,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1332104.6666666667, ans=0.125 2023-10-13 08:49:26,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1332104.6666666667, ans=0.0 2023-10-13 08:49:27,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1332104.6666666667, ans=0.125 2023-10-13 08:49:27,218 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.15 vs. limit=22.5 2023-10-13 08:49:30,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1332104.6666666667, ans=0.2 2023-10-13 08:49:40,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1332151.3333333333, ans=0.125 2023-10-13 08:49:43,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1332151.3333333333, ans=0.0 2023-10-13 08:49:55,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.775e+02 1.948e+02 2.123e+02 2.809e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-13 08:49:55,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1332198.0, ans=0.125 2023-10-13 08:50:05,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1332244.6666666667, ans=0.2 2023-10-13 08:50:08,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-10-13 08:50:12,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1332291.3333333333, ans=0.125 2023-10-13 08:50:37,923 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.33 vs. limit=15.0 2023-10-13 08:50:52,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=12.0 2023-10-13 08:51:28,650 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.21 vs. limit=12.0 2023-10-13 08:51:30,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-10-13 08:51:39,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1332664.6666666667, ans=0.2 2023-10-13 08:51:48,034 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.801e+02 1.924e+02 2.122e+02 2.835e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-13 08:51:54,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1332711.3333333333, ans=10.0 2023-10-13 08:52:11,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1332758.0, ans=0.0 2023-10-13 08:52:13,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1332758.0, ans=0.125 2023-10-13 08:52:15,102 INFO [train.py:1031] (0/4) Epoch 21, batch 12500, loss[loss=0.1704, simple_loss=0.2666, pruned_loss=0.03706, over 15489.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2808, pruned_loss=0.04875, over 32703367.43 frames. ], batch size: 35, lr: 1.63e-03, grad_scale: 32.0 2023-10-13 08:52:22,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1332804.6666666667, ans=0.125 2023-10-13 08:52:34,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1332851.3333333333, ans=0.0 2023-10-13 08:52:57,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=1332991.3333333333, ans=0.1 2023-10-13 08:53:13,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1333038.0, ans=0.0 2023-10-13 08:53:18,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1333084.6666666667, ans=0.1 2023-10-13 08:53:38,101 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.728e+02 1.904e+02 2.146e+02 2.661e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-13 08:53:48,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1333178.0, ans=0.125 2023-10-13 08:53:54,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.44 vs. limit=10.0 2023-10-13 08:54:06,770 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=14.53 vs. limit=15.0 2023-10-13 08:54:12,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-10-13 08:54:14,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1333318.0, ans=0.125 2023-10-13 08:54:27,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1333364.6666666667, ans=0.05 2023-10-13 08:54:43,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1333411.3333333333, ans=0.125 2023-10-13 08:54:53,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1333458.0, ans=0.125 2023-10-13 08:55:09,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1333504.6666666667, ans=0.125 2023-10-13 08:55:26,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1333598.0, ans=0.0 2023-10-13 08:55:26,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1333598.0, ans=10.0 2023-10-13 08:55:33,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.744e+02 1.975e+02 2.227e+02 3.194e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-13 08:55:37,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1333644.6666666667, ans=0.0 2023-10-13 08:55:38,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1333644.6666666667, ans=0.125 2023-10-13 08:55:38,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1333644.6666666667, ans=0.1 2023-10-13 08:56:15,272 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 08:56:19,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1333831.3333333333, ans=0.125 2023-10-13 08:56:29,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1333878.0, ans=0.0 2023-10-13 08:56:42,696 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-10-13 08:56:42,898 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.14 vs. limit=22.5 2023-10-13 08:56:58,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1333971.3333333333, ans=0.2 2023-10-13 08:57:00,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1333971.3333333333, ans=0.125 2023-10-13 08:57:00,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-10-13 08:57:14,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1334064.6666666667, ans=0.09899494936611666 2023-10-13 08:57:22,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.792e+02 1.940e+02 2.189e+02 3.673e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-13 08:57:44,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.01 vs. limit=22.5 2023-10-13 08:58:00,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1334251.3333333333, ans=0.0 2023-10-13 08:58:04,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1334251.3333333333, ans=0.5 2023-10-13 08:58:05,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1334298.0, ans=0.1 2023-10-13 08:58:05,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1334298.0, ans=0.1 2023-10-13 08:58:07,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1334298.0, ans=0.2 2023-10-13 08:58:17,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1334344.6666666667, ans=0.0 2023-10-13 08:58:22,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1334344.6666666667, ans=0.1 2023-10-13 08:58:39,242 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.99 vs. limit=15.0 2023-10-13 08:58:51,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1334438.0, ans=0.125 2023-10-13 08:58:54,788 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=15.0 2023-10-13 08:59:15,462 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.772e+02 1.938e+02 2.134e+02 3.184e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-13 08:59:23,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1334578.0, ans=0.0 2023-10-13 08:59:33,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1334624.6666666667, ans=0.125 2023-10-13 08:59:33,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1334624.6666666667, ans=0.0 2023-10-13 08:59:42,026 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2023-10-13 08:59:46,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1334671.3333333333, ans=0.0 2023-10-13 09:00:13,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1334811.3333333333, ans=0.0 2023-10-13 09:00:14,537 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-10-13 09:00:46,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.67 vs. limit=15.0 2023-10-13 09:00:50,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1334951.3333333333, ans=0.2 2023-10-13 09:00:51,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1334951.3333333333, ans=0.1 2023-10-13 09:00:59,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1334998.0, ans=0.2 2023-10-13 09:01:03,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1334998.0, ans=0.125 2023-10-13 09:01:05,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.725e+02 1.911e+02 2.174e+02 2.756e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-13 09:01:29,496 INFO [train.py:1031] (0/4) Epoch 21, batch 13000, loss[loss=0.1912, simple_loss=0.2861, pruned_loss=0.04816, over 16621.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2815, pruned_loss=0.04876, over 32736712.31 frames. ], batch size: 66, lr: 1.63e-03, grad_scale: 16.0 2023-10-13 09:02:12,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1335278.0, ans=0.09899494936611666 2023-10-13 09:02:33,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1335371.3333333333, ans=0.1 2023-10-13 09:02:54,715 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-10-13 09:03:10,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=15.0 2023-10-13 09:03:10,935 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.775e+02 1.946e+02 2.256e+02 3.466e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 09:03:12,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1335511.3333333333, ans=0.2 2023-10-13 09:03:22,848 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-10-13 09:03:26,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1335558.0, ans=0.025 2023-10-13 09:03:28,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1335558.0, ans=0.2 2023-10-13 09:03:37,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1335604.6666666667, ans=0.0 2023-10-13 09:03:42,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1335604.6666666667, ans=0.125 2023-10-13 09:03:49,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1335651.3333333333, ans=0.0 2023-10-13 09:03:55,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1335698.0, ans=0.0 2023-10-13 09:03:56,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1335698.0, ans=0.0 2023-10-13 09:04:06,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1335744.6666666667, ans=0.125 2023-10-13 09:04:07,489 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.68 vs. limit=15.0 2023-10-13 09:04:11,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1335744.6666666667, ans=0.2 2023-10-13 09:04:14,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1335744.6666666667, ans=0.125 2023-10-13 09:04:16,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.37 vs. limit=10.0 2023-10-13 09:04:38,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1335838.0, ans=0.2 2023-10-13 09:04:51,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1335931.3333333333, ans=0.125 2023-10-13 09:05:04,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.782e+02 1.898e+02 2.182e+02 3.086e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-13 09:05:31,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1336071.3333333333, ans=0.1 2023-10-13 09:05:59,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1336164.6666666667, ans=0.05 2023-10-13 09:06:07,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1336211.3333333333, ans=0.125 2023-10-13 09:06:12,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1336258.0, ans=15.0 2023-10-13 09:06:21,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1336258.0, ans=0.125 2023-10-13 09:06:43,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1336351.3333333333, ans=0.125 2023-10-13 09:06:45,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1336351.3333333333, ans=0.125 2023-10-13 09:06:57,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.722e+02 1.970e+02 2.159e+02 2.816e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-13 09:07:23,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1336538.0, ans=0.125 2023-10-13 09:07:25,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1336538.0, ans=0.2 2023-10-13 09:07:45,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-10-13 09:07:49,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1336631.3333333333, ans=0.1 2023-10-13 09:08:04,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1336724.6666666667, ans=0.0 2023-10-13 09:08:19,193 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-10-13 09:08:26,514 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.01 vs. limit=22.5 2023-10-13 09:08:43,573 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.18 vs. limit=15.0 2023-10-13 09:08:44,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1336864.6666666667, ans=0.09899494936611666 2023-10-13 09:08:46,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.782e+02 1.955e+02 2.143e+02 3.027e+02, threshold=3.910e+02, percent-clipped=0.0 2023-10-13 09:09:04,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1336958.0, ans=0.1 2023-10-13 09:09:20,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1337004.6666666667, ans=0.0 2023-10-13 09:09:25,432 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.19 vs. limit=15.0 2023-10-13 09:09:26,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1337051.3333333333, ans=0.1 2023-10-13 09:09:31,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.88 vs. limit=15.0 2023-10-13 09:10:00,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1337191.3333333333, ans=0.0 2023-10-13 09:10:27,631 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:10:41,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.786e+02 1.935e+02 2.197e+02 6.537e+02, threshold=3.870e+02, percent-clipped=1.0 2023-10-13 09:10:44,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1337378.0, ans=0.125 2023-10-13 09:11:02,425 INFO [train.py:1031] (0/4) Epoch 21, batch 13500, loss[loss=0.1825, simple_loss=0.2733, pruned_loss=0.04591, over 17030.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2806, pruned_loss=0.04837, over 32759853.56 frames. ], batch size: 82, lr: 1.63e-03, grad_scale: 16.0 2023-10-13 09:11:11,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1337471.3333333333, ans=0.2 2023-10-13 09:11:33,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1337564.6666666667, ans=0.0 2023-10-13 09:12:02,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1337704.6666666667, ans=0.0 2023-10-13 09:12:14,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.83 vs. limit=6.0 2023-10-13 09:12:21,535 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2023-10-13 09:12:34,414 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.740e+02 1.957e+02 2.153e+02 3.011e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-13 09:12:38,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1337844.6666666667, ans=10.0 2023-10-13 09:12:39,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1337844.6666666667, ans=15.0 2023-10-13 09:13:17,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1338031.3333333333, ans=0.125 2023-10-13 09:13:19,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1338031.3333333333, ans=0.0 2023-10-13 09:13:24,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1338031.3333333333, ans=0.125 2023-10-13 09:13:39,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1338124.6666666667, ans=0.1 2023-10-13 09:13:45,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1338171.3333333333, ans=0.125 2023-10-13 09:13:48,736 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-21.pt 2023-10-13 09:14:20,696 INFO [train.py:1031] (0/4) Epoch 22, batch 0, loss[loss=0.1688, simple_loss=0.2632, pruned_loss=0.03718, over 16683.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2632, pruned_loss=0.03718, over 16683.00 frames. ], batch size: 220, lr: 1.59e-03, grad_scale: 32.0 2023-10-13 09:14:20,697 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-13 09:14:28,990 INFO [train.py:1063] (0/4) Epoch 22, validation: loss=0.2133, simple_loss=0.3005, pruned_loss=0.06308, over 1020973.00 frames. 2023-10-13 09:14:28,990 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-13 09:14:33,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1338194.6666666667, ans=0.125 2023-10-13 09:14:54,673 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.59 vs. limit=15.0 2023-10-13 09:14:56,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1338288.0, ans=0.1 2023-10-13 09:14:58,815 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.740e+02 1.941e+02 2.149e+02 4.129e+02, threshold=3.883e+02, percent-clipped=1.0 2023-10-13 09:15:08,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1338334.6666666667, ans=0.125 2023-10-13 09:15:08,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1338334.6666666667, ans=0.1 2023-10-13 09:15:18,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1338381.3333333333, ans=0.125 2023-10-13 09:15:26,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1338381.3333333333, ans=0.0 2023-10-13 09:16:01,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1338521.3333333333, ans=0.125 2023-10-13 09:16:33,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1338661.3333333333, ans=0.125 2023-10-13 09:16:56,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1338754.6666666667, ans=0.02 2023-10-13 09:16:59,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.391e+02 1.741e+02 1.848e+02 2.031e+02 2.604e+02, threshold=3.696e+02, percent-clipped=0.0 2023-10-13 09:17:11,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.39 vs. limit=10.0 2023-10-13 09:17:26,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1338894.6666666667, ans=0.2 2023-10-13 09:17:26,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1338894.6666666667, ans=0.125 2023-10-13 09:17:38,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1338894.6666666667, ans=0.125 2023-10-13 09:18:03,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1339034.6666666667, ans=0.125 2023-10-13 09:18:16,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1339081.3333333333, ans=0.125 2023-10-13 09:18:24,235 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:18:31,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1339128.0, ans=0.0 2023-10-13 09:18:45,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-10-13 09:18:46,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1339174.6666666667, ans=0.125 2023-10-13 09:18:55,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1339221.3333333333, ans=0.04949747468305833 2023-10-13 09:18:55,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1339221.3333333333, ans=0.125 2023-10-13 09:18:56,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.751e+02 1.884e+02 2.078e+02 2.746e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-13 09:18:58,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1339221.3333333333, ans=0.125 2023-10-13 09:19:02,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1339268.0, ans=0.0 2023-10-13 09:19:05,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.27 vs. limit=22.5 2023-10-13 09:19:19,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1339314.6666666667, ans=0.125 2023-10-13 09:19:31,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1339361.3333333333, ans=0.04949747468305833 2023-10-13 09:19:39,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2023-10-13 09:19:52,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1339454.6666666667, ans=0.0 2023-10-13 09:20:44,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1339641.3333333333, ans=0.125 2023-10-13 09:20:45,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1339688.0, ans=0.2 2023-10-13 09:20:53,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.792e+02 1.962e+02 2.143e+02 3.064e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-13 09:20:54,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1339688.0, ans=0.125 2023-10-13 09:21:18,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1339828.0, ans=0.2 2023-10-13 09:21:19,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1339828.0, ans=0.1 2023-10-13 09:21:21,919 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:21:31,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1339874.6666666667, ans=0.1 2023-10-13 09:21:36,728 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=22.5 2023-10-13 09:21:46,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1339921.3333333333, ans=0.0 2023-10-13 09:21:48,469 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.07 vs. limit=15.0 2023-10-13 09:21:51,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1339968.0, ans=0.125 2023-10-13 09:21:57,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1339968.0, ans=0.0 2023-10-13 09:22:06,529 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.03 vs. limit=15.0 2023-10-13 09:22:08,407 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.04 vs. limit=15.0 2023-10-13 09:22:15,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1340061.3333333333, ans=0.125 2023-10-13 09:22:21,098 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.03 vs. limit=22.5 2023-10-13 09:22:23,595 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:22:37,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.20 vs. limit=15.0 2023-10-13 09:22:38,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1340154.6666666667, ans=0.125 2023-10-13 09:22:41,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1340154.6666666667, ans=0.07 2023-10-13 09:22:45,741 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.86 vs. limit=22.5 2023-10-13 09:22:46,197 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.780e+02 1.950e+02 2.263e+02 2.888e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-13 09:22:59,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1340201.3333333333, ans=0.2 2023-10-13 09:23:09,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1340248.0, ans=0.07 2023-10-13 09:23:15,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1340294.6666666667, ans=0.2 2023-10-13 09:23:38,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.34 vs. limit=22.5 2023-10-13 09:23:40,878 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-13 09:23:57,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1340434.6666666667, ans=0.2 2023-10-13 09:24:17,541 INFO [train.py:1031] (0/4) Epoch 22, batch 500, loss[loss=0.1936, simple_loss=0.2829, pruned_loss=0.05218, over 16609.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2812, pruned_loss=0.04929, over 7275669.26 frames. ], batch size: 66, lr: 1.59e-03, grad_scale: 32.0 2023-10-13 09:24:19,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1340528.0, ans=0.125 2023-10-13 09:24:19,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-10-13 09:24:22,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1340528.0, ans=0.125 2023-10-13 09:24:27,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1340574.6666666667, ans=0.125 2023-10-13 09:24:28,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1340574.6666666667, ans=0.015 2023-10-13 09:24:30,708 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:24:48,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.736e+02 1.964e+02 2.240e+02 2.918e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-13 09:24:53,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1340668.0, ans=0.125 2023-10-13 09:25:09,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1340714.6666666667, ans=0.0 2023-10-13 09:25:31,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.05 vs. limit=22.5 2023-10-13 09:25:37,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1340854.6666666667, ans=0.125 2023-10-13 09:26:13,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1340994.6666666667, ans=0.125 2023-10-13 09:26:13,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1340994.6666666667, ans=0.125 2023-10-13 09:26:27,087 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-10-13 09:26:42,925 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.790e+02 1.970e+02 2.216e+02 4.103e+02, threshold=3.940e+02, percent-clipped=1.0 2023-10-13 09:27:03,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1341181.3333333333, ans=0.5 2023-10-13 09:27:07,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1341181.3333333333, ans=0.125 2023-10-13 09:27:07,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1341181.3333333333, ans=0.1 2023-10-13 09:27:14,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1341228.0, ans=0.0 2023-10-13 09:27:16,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1341228.0, ans=0.2 2023-10-13 09:27:23,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.39 vs. limit=15.0 2023-10-13 09:28:14,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1341461.3333333333, ans=0.2 2023-10-13 09:28:25,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1341508.0, ans=0.125 2023-10-13 09:28:36,435 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.833e+02 1.988e+02 2.219e+02 3.076e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-13 09:28:37,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1341554.6666666667, ans=0.1 2023-10-13 09:28:52,435 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.45 vs. limit=22.5 2023-10-13 09:29:09,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1341694.6666666667, ans=0.125 2023-10-13 09:29:41,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.25 vs. limit=10.0 2023-10-13 09:29:47,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1341834.6666666667, ans=0.1 2023-10-13 09:29:48,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1341834.6666666667, ans=0.0 2023-10-13 09:29:54,973 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.33 vs. limit=15.0 2023-10-13 09:29:55,907 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-10-13 09:30:11,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1341928.0, ans=0.0 2023-10-13 09:30:26,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1341974.6666666667, ans=0.2 2023-10-13 09:30:35,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1342021.3333333333, ans=0.0 2023-10-13 09:30:37,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1342021.3333333333, ans=0.125 2023-10-13 09:30:41,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.806e+02 2.005e+02 2.271e+02 2.925e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-13 09:30:46,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1342068.0, ans=0.2 2023-10-13 09:30:48,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1342068.0, ans=0.1 2023-10-13 09:31:10,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1342114.6666666667, ans=0.07 2023-10-13 09:31:31,233 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.53 vs. limit=15.0 2023-10-13 09:31:37,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1342254.6666666667, ans=0.125 2023-10-13 09:31:40,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1342254.6666666667, ans=0.1 2023-10-13 09:31:45,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1342254.6666666667, ans=0.125 2023-10-13 09:32:06,063 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-10-13 09:32:25,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1342441.3333333333, ans=0.1 2023-10-13 09:32:25,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1342441.3333333333, ans=0.125 2023-10-13 09:32:32,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1342441.3333333333, ans=0.1 2023-10-13 09:32:41,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1342488.0, ans=0.2 2023-10-13 09:32:45,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-10-13 09:32:46,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.836e+02 2.021e+02 2.191e+02 3.176e+02, threshold=4.042e+02, percent-clipped=0.0 2023-10-13 09:32:50,370 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 09:32:57,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1342534.6666666667, ans=0.02 2023-10-13 09:33:08,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1342581.3333333333, ans=0.0 2023-10-13 09:33:17,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1342628.0, ans=0.2 2023-10-13 09:33:20,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1342628.0, ans=0.0 2023-10-13 09:33:33,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-10-13 09:33:38,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1342674.6666666667, ans=0.125 2023-10-13 09:33:48,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1342721.3333333333, ans=0.125 2023-10-13 09:34:05,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1342814.6666666667, ans=0.125 2023-10-13 09:34:07,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1342814.6666666667, ans=0.0 2023-10-13 09:34:14,719 INFO [train.py:1031] (0/4) Epoch 22, batch 1000, loss[loss=0.19, simple_loss=0.2847, pruned_loss=0.04761, over 16917.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2818, pruned_loss=0.04962, over 12916335.97 frames. ], batch size: 138, lr: 1.58e-03, grad_scale: 16.0 2023-10-13 09:34:19,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1342861.3333333333, ans=0.0 2023-10-13 09:34:21,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1342861.3333333333, ans=0.0 2023-10-13 09:34:31,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1342908.0, ans=0.125 2023-10-13 09:34:37,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1342954.6666666667, ans=0.04949747468305833 2023-10-13 09:34:42,680 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.51 vs. limit=15.0 2023-10-13 09:34:43,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.681e+02 1.793e+02 1.972e+02 2.395e+02, threshold=3.586e+02, percent-clipped=0.0 2023-10-13 09:34:52,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1343001.3333333333, ans=0.2 2023-10-13 09:35:09,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1343094.6666666667, ans=0.125 2023-10-13 09:35:39,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1343234.6666666667, ans=0.125 2023-10-13 09:36:01,642 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.06 vs. limit=22.5 2023-10-13 09:36:02,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1343328.0, ans=0.05 2023-10-13 09:36:07,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1343328.0, ans=0.1 2023-10-13 09:36:11,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1343328.0, ans=0.05 2023-10-13 09:36:16,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1343374.6666666667, ans=0.2 2023-10-13 09:36:39,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.824e+02 2.025e+02 2.383e+02 3.530e+02, threshold=4.050e+02, percent-clipped=0.0 2023-10-13 09:36:40,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1343421.3333333333, ans=0.125 2023-10-13 09:36:46,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1343468.0, ans=0.1 2023-10-13 09:37:01,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.83 vs. limit=15.0 2023-10-13 09:37:24,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1343608.0, ans=0.125 2023-10-13 09:37:27,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1343608.0, ans=0.2 2023-10-13 09:37:48,765 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.49 vs. limit=22.5 2023-10-13 09:38:14,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1343794.6666666667, ans=0.0 2023-10-13 09:38:19,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1343794.6666666667, ans=0.0 2023-10-13 09:38:48,434 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.670e+02 1.795e+02 1.958e+02 3.422e+02, threshold=3.591e+02, percent-clipped=0.0 2023-10-13 09:39:05,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1343981.3333333333, ans=0.125 2023-10-13 09:39:07,406 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-288000.pt 2023-10-13 09:39:13,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1343981.3333333333, ans=0.0 2023-10-13 09:39:15,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1343981.3333333333, ans=0.2 2023-10-13 09:39:18,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.26 vs. limit=15.0 2023-10-13 09:39:23,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1344028.0, ans=0.2 2023-10-13 09:39:26,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1344028.0, ans=0.125 2023-10-13 09:39:33,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1344074.6666666667, ans=0.07 2023-10-13 09:39:33,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1344074.6666666667, ans=0.2 2023-10-13 09:39:43,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1344121.3333333333, ans=0.0 2023-10-13 09:39:52,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1344168.0, ans=0.125 2023-10-13 09:39:53,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1344168.0, ans=0.1 2023-10-13 09:40:13,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1344214.6666666667, ans=0.0 2023-10-13 09:40:24,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.70 vs. limit=15.0 2023-10-13 09:40:37,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1344354.6666666667, ans=0.1 2023-10-13 09:40:39,146 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=15.0 2023-10-13 09:40:39,156 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-13 09:40:40,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1344354.6666666667, ans=0.2 2023-10-13 09:40:47,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.854e+02 2.061e+02 2.387e+02 3.180e+02, threshold=4.123e+02, percent-clipped=0.0 2023-10-13 09:41:11,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1344494.6666666667, ans=0.125 2023-10-13 09:41:14,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1344494.6666666667, ans=0.1 2023-10-13 09:41:14,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1344494.6666666667, ans=0.2 2023-10-13 09:41:14,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1344494.6666666667, ans=0.0 2023-10-13 09:41:55,367 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-10-13 09:41:56,198 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.13 vs. limit=15.0 2023-10-13 09:42:15,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1344728.0, ans=0.2 2023-10-13 09:42:20,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1344728.0, ans=0.125 2023-10-13 09:42:30,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-10-13 09:42:45,888 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.788e+02 1.936e+02 2.112e+02 2.690e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-13 09:44:04,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1345148.0, ans=0.125 2023-10-13 09:44:11,868 INFO [train.py:1031] (0/4) Epoch 22, batch 1500, loss[loss=0.1745, simple_loss=0.2602, pruned_loss=0.04441, over 16908.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.28, pruned_loss=0.04844, over 17309194.20 frames. ], batch size: 72, lr: 1.58e-03, grad_scale: 32.0 2023-10-13 09:44:16,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1345194.6666666667, ans=0.0 2023-10-13 09:44:45,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.786e+02 2.057e+02 2.352e+02 3.280e+02, threshold=4.113e+02, percent-clipped=0.0 2023-10-13 09:44:47,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1345334.6666666667, ans=0.0 2023-10-13 09:44:57,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1345334.6666666667, ans=0.0 2023-10-13 09:45:05,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1345381.3333333333, ans=0.125 2023-10-13 09:45:07,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1345381.3333333333, ans=0.07 2023-10-13 09:45:18,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1345428.0, ans=0.0 2023-10-13 09:45:19,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1345428.0, ans=0.1 2023-10-13 09:45:46,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1345521.3333333333, ans=0.07 2023-10-13 09:45:59,162 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.01 vs. limit=15.0 2023-10-13 09:46:17,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=15.0 2023-10-13 09:46:24,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1345708.0, ans=0.1 2023-10-13 09:46:30,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.39 vs. limit=10.0 2023-10-13 09:46:32,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345708.0, ans=0.1 2023-10-13 09:46:36,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1345754.6666666667, ans=0.0 2023-10-13 09:46:46,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.771e+02 1.893e+02 2.111e+02 2.847e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-13 09:46:51,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345801.3333333333, ans=0.1 2023-10-13 09:46:51,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1345801.3333333333, ans=0.0 2023-10-13 09:47:18,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1345894.6666666667, ans=0.0 2023-10-13 09:47:52,488 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-10-13 09:48:06,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1346081.3333333333, ans=0.04949747468305833 2023-10-13 09:48:20,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1346128.0, ans=0.125 2023-10-13 09:48:21,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1346128.0, ans=0.0 2023-10-13 09:48:26,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1346174.6666666667, ans=0.125 2023-10-13 09:48:45,752 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.817e+02 1.987e+02 2.211e+02 3.152e+02, threshold=3.975e+02, percent-clipped=0.0 2023-10-13 09:48:52,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1346268.0, ans=0.0 2023-10-13 09:48:55,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1346268.0, ans=0.125 2023-10-13 09:49:04,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-10-13 09:49:08,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1346361.3333333333, ans=0.125 2023-10-13 09:49:09,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1346361.3333333333, ans=0.125 2023-10-13 09:49:21,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1346408.0, ans=0.125 2023-10-13 09:49:29,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1346408.0, ans=0.1 2023-10-13 09:49:31,049 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.43 vs. limit=15.0 2023-10-13 09:50:05,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1346548.0, ans=0.125 2023-10-13 09:50:09,927 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.68 vs. limit=10.0 2023-10-13 09:50:11,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1346548.0, ans=0.125 2023-10-13 09:50:39,640 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-10-13 09:50:48,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.781e+02 1.957e+02 2.197e+02 3.261e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-13 09:50:49,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1346688.0, ans=0.125 2023-10-13 09:51:17,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.84 vs. limit=12.0 2023-10-13 09:51:22,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1346828.0, ans=0.125 2023-10-13 09:51:32,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1346874.6666666667, ans=0.0 2023-10-13 09:51:58,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1346968.0, ans=0.0 2023-10-13 09:51:58,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1346968.0, ans=0.125 2023-10-13 09:52:06,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1347014.6666666667, ans=0.125 2023-10-13 09:52:44,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1347154.6666666667, ans=0.125 2023-10-13 09:52:47,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.855e+02 2.017e+02 2.260e+02 4.426e+02, threshold=4.034e+02, percent-clipped=1.0 2023-10-13 09:52:56,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.33 vs. limit=15.0 2023-10-13 09:53:22,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1347294.6666666667, ans=0.09899494936611666 2023-10-13 09:53:57,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1347388.0, ans=0.0 2023-10-13 09:54:00,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1347388.0, ans=0.125 2023-10-13 09:54:07,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1347388.0, ans=10.0 2023-10-13 09:54:12,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.53 vs. limit=22.5 2023-10-13 09:54:34,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1347528.0, ans=0.2 2023-10-13 09:54:36,020 INFO [train.py:1031] (0/4) Epoch 22, batch 2000, loss[loss=0.1876, simple_loss=0.289, pruned_loss=0.04307, over 16863.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.281, pruned_loss=0.04894, over 20715911.08 frames. ], batch size: 165, lr: 1.58e-03, grad_scale: 32.0 2023-10-13 09:54:38,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.00 vs. limit=12.0 2023-10-13 09:55:19,314 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.717e+02 1.874e+02 2.067e+02 3.063e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 09:55:21,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1347668.0, ans=0.125 2023-10-13 09:55:32,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1347668.0, ans=0.125 2023-10-13 09:55:33,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1347668.0, ans=0.04949747468305833 2023-10-13 09:55:37,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1347714.6666666667, ans=10.0 2023-10-13 09:56:11,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1347808.0, ans=0.1 2023-10-13 09:56:24,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1347854.6666666667, ans=0.1 2023-10-13 09:56:52,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1347948.0, ans=0.0 2023-10-13 09:57:04,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-10-13 09:57:28,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1348041.3333333333, ans=0.125 2023-10-13 09:57:41,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1348088.0, ans=0.0 2023-10-13 09:57:51,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.725e+02 1.936e+02 2.204e+02 2.901e+02, threshold=3.872e+02, percent-clipped=0.0 2023-10-13 09:57:52,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1348134.6666666667, ans=0.0 2023-10-13 09:58:31,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1348228.0, ans=0.125 2023-10-13 09:59:08,455 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.68 vs. limit=15.0 2023-10-13 09:59:17,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1348414.6666666667, ans=0.125 2023-10-13 09:59:32,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1348461.3333333333, ans=0.0 2023-10-13 09:59:58,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1348554.6666666667, ans=0.04949747468305833 2023-10-13 10:00:00,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.904e+02 2.114e+02 2.411e+02 3.158e+02, threshold=4.228e+02, percent-clipped=0.0 2023-10-13 10:00:05,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1348601.3333333333, ans=0.125 2023-10-13 10:00:28,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1348694.6666666667, ans=0.2 2023-10-13 10:00:35,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1348741.3333333333, ans=0.125 2023-10-13 10:01:01,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1348788.0, ans=0.1 2023-10-13 10:01:12,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1348834.6666666667, ans=0.125 2023-10-13 10:01:47,997 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:01:56,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1349021.3333333333, ans=0.125 2023-10-13 10:01:59,495 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:02:00,114 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.823e+02 1.945e+02 2.144e+02 2.654e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-13 10:02:20,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-10-13 10:02:23,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1349161.3333333333, ans=0.1 2023-10-13 10:02:36,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1349208.0, ans=0.125 2023-10-13 10:02:50,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1349254.6666666667, ans=0.1 2023-10-13 10:03:22,712 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:03:36,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1349441.3333333333, ans=0.125 2023-10-13 10:03:40,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1349441.3333333333, ans=0.125 2023-10-13 10:03:44,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1349488.0, ans=0.125 2023-10-13 10:03:48,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1349488.0, ans=0.125 2023-10-13 10:03:54,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1349488.0, ans=0.1 2023-10-13 10:03:56,641 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.827e+02 1.997e+02 2.193e+02 3.325e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-13 10:03:58,821 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:04:08,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1349581.3333333333, ans=0.0 2023-10-13 10:04:23,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1349628.0, ans=0.1 2023-10-13 10:04:42,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1349721.3333333333, ans=0.125 2023-10-13 10:05:05,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1349814.6666666667, ans=0.09899494936611666 2023-10-13 10:05:15,104 INFO [train.py:1031] (0/4) Epoch 22, batch 2500, loss[loss=0.1867, simple_loss=0.2845, pruned_loss=0.04446, over 16939.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2815, pruned_loss=0.04946, over 23355998.67 frames. ], batch size: 110, lr: 1.58e-03, grad_scale: 32.0 2023-10-13 10:05:19,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1349861.3333333333, ans=0.0 2023-10-13 10:05:25,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.52 vs. limit=15.0 2023-10-13 10:05:35,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1349908.0, ans=0.125 2023-10-13 10:05:49,681 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:05:50,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.32 vs. limit=10.0 2023-10-13 10:05:50,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.314e+02 1.795e+02 1.993e+02 2.166e+02 3.553e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-13 10:05:59,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1350001.3333333333, ans=0.1 2023-10-13 10:06:03,626 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:07:04,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1350281.3333333333, ans=0.125 2023-10-13 10:07:07,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1350328.0, ans=0.0 2023-10-13 10:07:09,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1350328.0, ans=0.0 2023-10-13 10:07:40,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.87 vs. limit=10.0 2023-10-13 10:07:43,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1350468.0, ans=0.1 2023-10-13 10:07:45,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.743e+02 1.893e+02 2.099e+02 2.638e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-13 10:07:58,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1350514.6666666667, ans=0.2 2023-10-13 10:08:03,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1350514.6666666667, ans=0.0 2023-10-13 10:08:08,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1350514.6666666667, ans=0.2 2023-10-13 10:08:16,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1350561.3333333333, ans=0.1 2023-10-13 10:08:18,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1350561.3333333333, ans=0.125 2023-10-13 10:08:18,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=15.0 2023-10-13 10:08:22,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1350608.0, ans=0.1 2023-10-13 10:08:33,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1350654.6666666667, ans=0.125 2023-10-13 10:08:37,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1350654.6666666667, ans=0.125 2023-10-13 10:08:39,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1350654.6666666667, ans=0.1 2023-10-13 10:09:05,838 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-10-13 10:09:05,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.34 vs. limit=12.0 2023-10-13 10:09:22,424 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:09:22,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1350794.6666666667, ans=0.125 2023-10-13 10:09:38,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1350888.0, ans=0.125 2023-10-13 10:09:39,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1350888.0, ans=0.125 2023-10-13 10:09:42,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1350888.0, ans=0.1 2023-10-13 10:09:42,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1350888.0, ans=0.1 2023-10-13 10:09:50,882 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.796e+02 1.918e+02 2.152e+02 2.768e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-13 10:10:26,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1351028.0, ans=0.2 2023-10-13 10:10:44,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1351074.6666666667, ans=0.125 2023-10-13 10:10:53,224 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=15.0 2023-10-13 10:11:03,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1351168.0, ans=0.125 2023-10-13 10:11:37,398 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.81 vs. limit=15.0 2023-10-13 10:11:41,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1351308.0, ans=0.125 2023-10-13 10:12:05,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.770e+02 1.961e+02 2.104e+02 2.795e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-13 10:12:13,304 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.38 vs. limit=15.0 2023-10-13 10:12:24,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1351448.0, ans=0.125 2023-10-13 10:12:30,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1351448.0, ans=0.0 2023-10-13 10:12:34,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1351494.6666666667, ans=0.125 2023-10-13 10:12:39,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1351494.6666666667, ans=0.1 2023-10-13 10:12:40,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-10-13 10:12:40,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1351494.6666666667, ans=10.0 2023-10-13 10:12:43,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1351494.6666666667, ans=0.125 2023-10-13 10:12:53,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1351541.3333333333, ans=0.125 2023-10-13 10:12:54,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1351541.3333333333, ans=0.05 2023-10-13 10:13:14,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1351588.0, ans=0.2 2023-10-13 10:13:14,414 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-10-13 10:13:16,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1351634.6666666667, ans=0.125 2023-10-13 10:14:13,201 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-10-13 10:14:18,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1351868.0, ans=0.0 2023-10-13 10:14:19,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.747e+02 1.900e+02 2.100e+02 2.578e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-13 10:14:26,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1351868.0, ans=0.125 2023-10-13 10:14:39,155 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-10-13 10:14:48,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1351961.3333333333, ans=0.125 2023-10-13 10:15:02,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1352008.0, ans=0.125 2023-10-13 10:15:06,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1352054.6666666667, ans=0.0 2023-10-13 10:15:17,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2023-10-13 10:15:22,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=12.0 2023-10-13 10:15:25,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1352101.3333333333, ans=0.025 2023-10-13 10:15:38,676 INFO [train.py:1031] (0/4) Epoch 22, batch 3000, loss[loss=0.1819, simple_loss=0.2783, pruned_loss=0.04273, over 16908.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2805, pruned_loss=0.04922, over 25429831.65 frames. ], batch size: 87, lr: 1.58e-03, grad_scale: 32.0 2023-10-13 10:15:46,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1352194.6666666667, ans=0.125 2023-10-13 10:16:12,348 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.777e+02 1.997e+02 2.241e+02 2.858e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-13 10:16:20,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1352334.6666666667, ans=0.0 2023-10-13 10:16:46,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1352474.6666666667, ans=0.125 2023-10-13 10:16:48,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1352474.6666666667, ans=0.125 2023-10-13 10:17:17,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.56 vs. limit=15.0 2023-10-13 10:17:19,246 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:17:19,720 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-10-13 10:17:30,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-10-13 10:17:34,060 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:17:46,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1352661.3333333333, ans=0.5 2023-10-13 10:18:00,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1352708.0, ans=0.0 2023-10-13 10:18:13,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1352801.3333333333, ans=0.125 2023-10-13 10:18:15,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.771e+02 1.929e+02 2.164e+02 3.098e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-13 10:18:22,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1352801.3333333333, ans=0.0 2023-10-13 10:18:24,885 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:18:31,361 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=22.5 2023-10-13 10:18:52,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1352941.3333333333, ans=0.0 2023-10-13 10:19:02,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1352988.0, ans=0.125 2023-10-13 10:19:10,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1353034.6666666667, ans=0.0 2023-10-13 10:19:19,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1353034.6666666667, ans=0.0 2023-10-13 10:19:26,592 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:19:56,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=1353221.3333333333, ans=0.02 2023-10-13 10:19:59,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1353221.3333333333, ans=0.035 2023-10-13 10:20:14,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.752e+02 1.912e+02 2.105e+02 3.038e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-13 10:20:24,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1353268.0, ans=0.2 2023-10-13 10:20:25,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2023-10-13 10:20:32,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1353314.6666666667, ans=0.0 2023-10-13 10:20:48,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1353361.3333333333, ans=0.0 2023-10-13 10:21:02,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1353408.0, ans=0.1 2023-10-13 10:21:02,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1353408.0, ans=0.2 2023-10-13 10:21:40,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1353548.0, ans=0.95 2023-10-13 10:22:15,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1353688.0, ans=0.125 2023-10-13 10:22:16,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1353688.0, ans=0.05 2023-10-13 10:22:23,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1353734.6666666667, ans=0.0 2023-10-13 10:22:24,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.799e+02 1.983e+02 2.286e+02 3.832e+02, threshold=3.967e+02, percent-clipped=1.0 2023-10-13 10:22:28,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1353734.6666666667, ans=0.125 2023-10-13 10:22:36,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1353781.3333333333, ans=0.2 2023-10-13 10:22:43,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1353781.3333333333, ans=0.125 2023-10-13 10:22:50,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.40 vs. limit=22.5 2023-10-13 10:22:53,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1353828.0, ans=0.2 2023-10-13 10:22:57,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1353874.6666666667, ans=0.2 2023-10-13 10:22:59,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1353874.6666666667, ans=0.125 2023-10-13 10:23:23,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1353921.3333333333, ans=0.125 2023-10-13 10:24:26,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1354154.6666666667, ans=0.125 2023-10-13 10:24:30,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1354154.6666666667, ans=0.125 2023-10-13 10:24:33,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1354201.3333333333, ans=0.0 2023-10-13 10:24:34,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.789e+02 1.973e+02 2.192e+02 3.127e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-13 10:25:17,650 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2023-10-13 10:25:37,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.17 vs. limit=15.0 2023-10-13 10:25:54,360 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:25:54,984 INFO [train.py:1031] (0/4) Epoch 22, batch 3500, loss[loss=0.1844, simple_loss=0.2798, pruned_loss=0.0445, over 16832.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2805, pruned_loss=0.04914, over 27082979.77 frames. ], batch size: 98, lr: 1.58e-03, grad_scale: 16.0 2023-10-13 10:25:56,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1354528.0, ans=0.125 2023-10-13 10:26:01,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1354528.0, ans=0.125 2023-10-13 10:26:10,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1354574.6666666667, ans=0.125 2023-10-13 10:26:24,517 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.50 vs. limit=15.0 2023-10-13 10:26:31,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.775e+02 1.923e+02 2.142e+02 3.464e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-13 10:26:31,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1354668.0, ans=0.2 2023-10-13 10:26:38,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1354668.0, ans=0.04949747468305833 2023-10-13 10:26:49,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1354714.6666666667, ans=0.125 2023-10-13 10:26:53,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1354714.6666666667, ans=0.125 2023-10-13 10:27:44,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1354854.6666666667, ans=0.2 2023-10-13 10:27:51,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1354901.3333333333, ans=0.09899494936611666 2023-10-13 10:28:15,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1354994.6666666667, ans=0.125 2023-10-13 10:28:25,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1355041.3333333333, ans=0.0 2023-10-13 10:28:53,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.838e+02 1.983e+02 2.163e+02 2.976e+02, threshold=3.967e+02, percent-clipped=0.0 2023-10-13 10:29:12,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1355181.3333333333, ans=0.0 2023-10-13 10:29:13,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1355181.3333333333, ans=0.015 2023-10-13 10:29:58,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1355368.0, ans=0.125 2023-10-13 10:30:05,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1355414.6666666667, ans=0.2 2023-10-13 10:30:05,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1355414.6666666667, ans=0.0 2023-10-13 10:30:06,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1355414.6666666667, ans=0.07 2023-10-13 10:30:12,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1355414.6666666667, ans=0.0 2023-10-13 10:30:19,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1355461.3333333333, ans=0.2 2023-10-13 10:30:22,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1355461.3333333333, ans=0.125 2023-10-13 10:30:23,267 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-10-13 10:30:27,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1355508.0, ans=0.04949747468305833 2023-10-13 10:30:30,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1355508.0, ans=0.0 2023-10-13 10:30:48,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1355554.6666666667, ans=0.04949747468305833 2023-10-13 10:30:59,592 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.717e+02 1.888e+02 2.057e+02 3.044e+02, threshold=3.776e+02, percent-clipped=0.0 2023-10-13 10:31:34,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1355741.3333333333, ans=0.04949747468305833 2023-10-13 10:31:39,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1355741.3333333333, ans=0.0 2023-10-13 10:31:44,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1355788.0, ans=0.1 2023-10-13 10:31:46,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1355788.0, ans=0.0 2023-10-13 10:32:31,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1355928.0, ans=0.125 2023-10-13 10:32:50,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1356021.3333333333, ans=0.125 2023-10-13 10:32:59,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1356068.0, ans=0.125 2023-10-13 10:33:01,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.893e+02 2.171e+02 2.320e+02 3.701e+02, threshold=4.342e+02, percent-clipped=0.0 2023-10-13 10:33:05,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1356068.0, ans=0.125 2023-10-13 10:33:05,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.55 vs. limit=22.5 2023-10-13 10:33:06,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1356068.0, ans=0.125 2023-10-13 10:34:00,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-10-13 10:34:15,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1356348.0, ans=0.125 2023-10-13 10:34:17,230 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.04 vs. limit=22.5 2023-10-13 10:34:29,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1356394.6666666667, ans=0.125 2023-10-13 10:34:40,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1356441.3333333333, ans=0.0 2023-10-13 10:34:40,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1356441.3333333333, ans=0.125 2023-10-13 10:34:41,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1356441.3333333333, ans=0.125 2023-10-13 10:34:58,263 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.721e+02 1.916e+02 2.047e+02 2.709e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-13 10:34:58,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.80 vs. limit=22.5 2023-10-13 10:35:13,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1356581.3333333333, ans=0.125 2023-10-13 10:35:23,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1356628.0, ans=0.0 2023-10-13 10:35:41,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1356721.3333333333, ans=0.125 2023-10-13 10:35:41,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1356721.3333333333, ans=0.125 2023-10-13 10:36:08,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1356814.6666666667, ans=0.0 2023-10-13 10:36:16,486 INFO [train.py:1031] (0/4) Epoch 22, batch 4000, loss[loss=0.2013, simple_loss=0.2934, pruned_loss=0.05463, over 16934.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2801, pruned_loss=0.049, over 28356997.98 frames. ], batch size: 156, lr: 1.58e-03, grad_scale: 32.0 2023-10-13 10:36:30,255 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.88 vs. limit=22.5 2023-10-13 10:36:31,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1356908.0, ans=0.1 2023-10-13 10:36:33,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1356908.0, ans=0.125 2023-10-13 10:36:43,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1356954.6666666667, ans=0.1 2023-10-13 10:36:50,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1356954.6666666667, ans=0.0 2023-10-13 10:36:51,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1356954.6666666667, ans=0.0 2023-10-13 10:36:58,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.864e+02 2.093e+02 2.403e+02 3.130e+02, threshold=4.186e+02, percent-clipped=0.0 2023-10-13 10:37:02,343 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=12.0 2023-10-13 10:37:11,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1357048.0, ans=0.0 2023-10-13 10:37:39,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1357188.0, ans=0.125 2023-10-13 10:37:45,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1357188.0, ans=0.0 2023-10-13 10:37:54,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.54 vs. limit=15.0 2023-10-13 10:38:07,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1357281.3333333333, ans=0.05 2023-10-13 10:38:16,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1357328.0, ans=0.0 2023-10-13 10:38:18,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1357328.0, ans=0.0 2023-10-13 10:38:30,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1357374.6666666667, ans=0.1 2023-10-13 10:38:35,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1357374.6666666667, ans=0.0 2023-10-13 10:38:41,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1357421.3333333333, ans=0.0 2023-10-13 10:38:57,332 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.799e+02 1.969e+02 2.264e+02 3.202e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-13 10:39:02,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1357468.0, ans=0.0 2023-10-13 10:39:04,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1357514.6666666667, ans=0.1 2023-10-13 10:39:54,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1357654.6666666667, ans=0.125 2023-10-13 10:40:12,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1357701.3333333333, ans=0.125 2023-10-13 10:40:23,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1357748.0, ans=0.07 2023-10-13 10:40:35,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1357794.6666666667, ans=0.1 2023-10-13 10:40:45,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1357841.3333333333, ans=0.1 2023-10-13 10:40:52,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1357841.3333333333, ans=0.1 2023-10-13 10:40:58,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1357888.0, ans=0.125 2023-10-13 10:41:14,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.757e+02 1.965e+02 2.112e+02 2.978e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-13 10:41:14,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1357934.6666666667, ans=0.125 2023-10-13 10:41:22,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1357981.3333333333, ans=0.125 2023-10-13 10:41:23,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1357981.3333333333, ans=0.125 2023-10-13 10:41:27,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1357981.3333333333, ans=0.125 2023-10-13 10:41:27,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1357981.3333333333, ans=15.0 2023-10-13 10:41:40,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1358028.0, ans=0.125 2023-10-13 10:41:40,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1358028.0, ans=0.2 2023-10-13 10:41:46,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1358074.6666666667, ans=0.125 2023-10-13 10:41:56,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1358121.3333333333, ans=0.125 2023-10-13 10:41:58,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1358121.3333333333, ans=0.125 2023-10-13 10:42:00,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1358121.3333333333, ans=0.0 2023-10-13 10:42:06,491 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.33 vs. limit=6.0 2023-10-13 10:42:28,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1358214.6666666667, ans=0.1 2023-10-13 10:42:37,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.83 vs. limit=15.0 2023-10-13 10:42:44,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1358308.0, ans=0.125 2023-10-13 10:43:02,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1358354.6666666667, ans=0.125 2023-10-13 10:43:08,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1358401.3333333333, ans=0.125 2023-10-13 10:43:13,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.783e+02 1.943e+02 2.136e+02 2.979e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-13 10:43:20,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1358448.0, ans=0.0 2023-10-13 10:43:56,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1358541.3333333333, ans=0.125 2023-10-13 10:44:16,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=1358634.6666666667, ans=0.02 2023-10-13 10:44:32,332 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:44:37,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1358728.0, ans=0.0 2023-10-13 10:44:41,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1358728.0, ans=0.0 2023-10-13 10:44:43,781 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.61 vs. limit=22.5 2023-10-13 10:45:15,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1358868.0, ans=10.0 2023-10-13 10:45:19,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1358868.0, ans=0.1 2023-10-13 10:45:23,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.817e+02 1.978e+02 2.218e+02 3.339e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-13 10:45:39,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1358914.6666666667, ans=0.0 2023-10-13 10:45:45,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1358961.3333333333, ans=0.125 2023-10-13 10:45:55,225 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.12 vs. limit=10.0 2023-10-13 10:46:02,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1359008.0, ans=0.1 2023-10-13 10:46:28,202 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-10-13 10:46:29,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1359101.3333333333, ans=0.0 2023-10-13 10:46:40,030 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-10-13 10:46:47,903 INFO [train.py:1031] (0/4) Epoch 22, batch 4500, loss[loss=0.1771, simple_loss=0.2694, pruned_loss=0.04238, over 16886.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2805, pruned_loss=0.04884, over 29362993.00 frames. ], batch size: 130, lr: 1.57e-03, grad_scale: 32.0 2023-10-13 10:46:50,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1359194.6666666667, ans=0.125 2023-10-13 10:46:52,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1359194.6666666667, ans=0.125 2023-10-13 10:47:10,228 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.19 vs. limit=15.0 2023-10-13 10:47:12,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1359288.0, ans=0.125 2023-10-13 10:47:15,266 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.06 vs. limit=15.0 2023-10-13 10:47:18,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1359288.0, ans=0.125 2023-10-13 10:47:25,477 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.737e+02 1.884e+02 2.066e+02 2.936e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-13 10:47:54,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1359474.6666666667, ans=0.0 2023-10-13 10:48:12,060 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.61 vs. limit=15.0 2023-10-13 10:48:21,332 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-10-13 10:48:41,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1359661.3333333333, ans=0.125 2023-10-13 10:49:09,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1359754.6666666667, ans=0.2 2023-10-13 10:49:09,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1359754.6666666667, ans=0.125 2023-10-13 10:49:13,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1359754.6666666667, ans=0.125 2023-10-13 10:49:19,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.735e+02 1.959e+02 2.214e+02 3.494e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-13 10:49:23,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1359801.3333333333, ans=0.125 2023-10-13 10:49:26,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1359848.0, ans=0.125 2023-10-13 10:49:27,487 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:49:31,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=15.0 2023-10-13 10:49:35,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1359848.0, ans=0.07 2023-10-13 10:50:23,969 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-13 10:50:24,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1360034.6666666667, ans=0.0 2023-10-13 10:50:51,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1360128.0, ans=0.0 2023-10-13 10:50:59,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1360174.6666666667, ans=0.125 2023-10-13 10:51:02,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1360174.6666666667, ans=0.125 2023-10-13 10:51:05,865 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.83 vs. limit=10.0 2023-10-13 10:51:10,401 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.36 vs. limit=15.0 2023-10-13 10:51:26,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.788e+02 1.942e+02 2.165e+02 3.503e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 10:51:27,323 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 10:51:33,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1360314.6666666667, ans=0.125 2023-10-13 10:51:43,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1360361.3333333333, ans=0.125 2023-10-13 10:51:45,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1360361.3333333333, ans=0.0 2023-10-13 10:51:51,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1360361.3333333333, ans=0.1 2023-10-13 10:51:53,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1360361.3333333333, ans=0.125 2023-10-13 10:51:58,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1360408.0, ans=0.2 2023-10-13 10:52:06,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1360454.6666666667, ans=0.125 2023-10-13 10:52:12,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.53 vs. limit=22.5 2023-10-13 10:53:06,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1360688.0, ans=0.1 2023-10-13 10:53:22,584 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.704e+02 1.836e+02 2.009e+02 2.865e+02, threshold=3.671e+02, percent-clipped=0.0 2023-10-13 10:53:26,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1360734.6666666667, ans=0.125 2023-10-13 10:53:56,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1360874.6666666667, ans=0.95 2023-10-13 10:53:57,666 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-10-13 10:54:05,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1360921.3333333333, ans=0.0 2023-10-13 10:54:12,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1360921.3333333333, ans=0.0 2023-10-13 10:54:37,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1361014.6666666667, ans=0.0 2023-10-13 10:54:47,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1361061.3333333333, ans=0.0 2023-10-13 10:54:50,455 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.59 vs. limit=22.5 2023-10-13 10:55:07,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1361154.6666666667, ans=0.125 2023-10-13 10:55:21,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1361201.3333333333, ans=0.0 2023-10-13 10:55:22,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.747e+02 1.921e+02 2.111e+02 3.439e+02, threshold=3.843e+02, percent-clipped=0.0 2023-10-13 10:55:48,198 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=22.5 2023-10-13 10:55:50,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1361294.6666666667, ans=0.2 2023-10-13 10:55:52,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1361341.3333333333, ans=0.2 2023-10-13 10:56:04,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1361388.0, ans=0.0 2023-10-13 10:56:13,582 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.51 vs. limit=15.0 2023-10-13 10:56:24,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1361434.6666666667, ans=0.125 2023-10-13 10:56:31,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1361481.3333333333, ans=0.125 2023-10-13 10:56:32,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.45 vs. limit=12.0 2023-10-13 10:56:36,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.18 vs. limit=22.5 2023-10-13 10:56:37,278 INFO [train.py:1031] (0/4) Epoch 22, batch 5000, loss[loss=0.1899, simple_loss=0.28, pruned_loss=0.04993, over 16935.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2803, pruned_loss=0.04898, over 30131237.85 frames. ], batch size: 72, lr: 1.57e-03, grad_scale: 16.0 2023-10-13 10:56:56,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1361574.6666666667, ans=0.125 2023-10-13 10:57:01,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1361621.3333333333, ans=0.0 2023-10-13 10:57:10,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-10-13 10:57:19,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.831e+02 1.966e+02 2.188e+02 2.975e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-13 10:57:22,316 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.73 vs. limit=22.5 2023-10-13 10:57:27,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1361714.6666666667, ans=0.0 2023-10-13 10:57:29,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1361714.6666666667, ans=0.125 2023-10-13 10:57:30,135 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-10-13 10:57:37,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1361761.3333333333, ans=0.125 2023-10-13 10:57:41,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1361761.3333333333, ans=0.125 2023-10-13 10:57:51,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1361808.0, ans=0.125 2023-10-13 10:58:00,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1361854.6666666667, ans=0.125 2023-10-13 10:58:35,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1361948.0, ans=0.125 2023-10-13 10:58:37,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1361994.6666666667, ans=0.125 2023-10-13 10:58:50,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1362041.3333333333, ans=0.125 2023-10-13 10:59:02,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1362088.0, ans=0.05 2023-10-13 10:59:08,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1362088.0, ans=0.125 2023-10-13 10:59:17,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.784e+02 2.009e+02 2.283e+02 3.097e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-13 10:59:17,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=12.0 2023-10-13 10:59:19,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1362134.6666666667, ans=0.2 2023-10-13 10:59:23,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1362181.3333333333, ans=0.0 2023-10-13 11:00:26,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1362414.6666666667, ans=0.125 2023-10-13 11:00:30,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1362414.6666666667, ans=0.0 2023-10-13 11:00:59,883 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:01:12,023 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.774e+02 1.922e+02 2.113e+02 3.095e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-13 11:01:15,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1362601.3333333333, ans=0.1 2023-10-13 11:01:20,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1362648.0, ans=0.125 2023-10-13 11:01:23,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1362648.0, ans=0.125 2023-10-13 11:01:46,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1362741.3333333333, ans=0.1 2023-10-13 11:02:13,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1362834.6666666667, ans=0.1 2023-10-13 11:02:35,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1362928.0, ans=0.125 2023-10-13 11:02:35,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1362928.0, ans=0.2 2023-10-13 11:02:53,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1363021.3333333333, ans=0.125 2023-10-13 11:02:58,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1363021.3333333333, ans=0.0 2023-10-13 11:03:05,755 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.65 vs. limit=15.0 2023-10-13 11:03:15,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.775e+02 1.988e+02 2.213e+02 2.967e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-13 11:03:41,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=15.0 2023-10-13 11:04:20,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1363348.0, ans=0.0 2023-10-13 11:04:28,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1363348.0, ans=0.0 2023-10-13 11:04:28,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=1363348.0, ans=0.02 2023-10-13 11:04:44,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363441.3333333333, ans=0.1 2023-10-13 11:04:44,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1363441.3333333333, ans=0.1 2023-10-13 11:04:50,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1363441.3333333333, ans=0.125 2023-10-13 11:04:54,906 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.93 vs. limit=15.0 2023-10-13 11:04:56,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1363488.0, ans=0.125 2023-10-13 11:05:05,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1363534.6666666667, ans=0.1 2023-10-13 11:05:10,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.768e+02 2.037e+02 2.248e+02 4.172e+02, threshold=4.075e+02, percent-clipped=1.0 2023-10-13 11:05:22,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1363581.3333333333, ans=0.025 2023-10-13 11:05:27,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1363581.3333333333, ans=0.0 2023-10-13 11:05:33,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1363628.0, ans=0.09899494936611666 2023-10-13 11:05:33,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1363628.0, ans=0.0 2023-10-13 11:06:03,896 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=12.0 2023-10-13 11:06:20,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1363814.6666666667, ans=0.125 2023-10-13 11:06:27,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1363814.6666666667, ans=0.125 2023-10-13 11:06:31,368 INFO [train.py:1031] (0/4) Epoch 22, batch 5500, loss[loss=0.1984, simple_loss=0.2618, pruned_loss=0.06748, over 12313.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2801, pruned_loss=0.04875, over 30720592.30 frames. ], batch size: 440, lr: 1.57e-03, grad_scale: 32.0 2023-10-13 11:06:35,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1363861.3333333333, ans=0.1 2023-10-13 11:07:08,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1364001.3333333333, ans=0.0 2023-10-13 11:07:09,586 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.700e+02 1.858e+02 1.994e+02 3.264e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-13 11:07:15,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1364048.0, ans=0.125 2023-10-13 11:07:18,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-10-13 11:07:18,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1364048.0, ans=0.125 2023-10-13 11:07:27,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1364048.0, ans=0.0 2023-10-13 11:07:49,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1364141.3333333333, ans=0.125 2023-10-13 11:08:01,437 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.92 vs. limit=12.0 2023-10-13 11:08:02,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.15 vs. limit=15.0 2023-10-13 11:08:20,976 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.93 vs. limit=6.0 2023-10-13 11:08:53,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1364421.3333333333, ans=0.125 2023-10-13 11:09:07,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.733e+02 1.854e+02 2.013e+02 2.761e+02, threshold=3.709e+02, percent-clipped=0.0 2023-10-13 11:09:24,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-13 11:09:35,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1364561.3333333333, ans=0.2 2023-10-13 11:09:36,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1364608.0, ans=0.0 2023-10-13 11:09:49,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1364654.6666666667, ans=0.0 2023-10-13 11:10:05,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1364701.3333333333, ans=0.125 2023-10-13 11:10:16,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1364748.0, ans=0.125 2023-10-13 11:10:21,947 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.85 vs. limit=12.0 2023-10-13 11:10:42,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1364841.3333333333, ans=0.95 2023-10-13 11:10:59,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1364888.0, ans=0.0 2023-10-13 11:11:02,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-10-13 11:11:06,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.799e+02 1.977e+02 2.222e+02 3.111e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-13 11:11:15,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1364981.3333333333, ans=0.0 2023-10-13 11:11:15,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1364981.3333333333, ans=0.125 2023-10-13 11:11:18,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1364981.3333333333, ans=0.2 2023-10-13 11:11:36,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1365074.6666666667, ans=0.0 2023-10-13 11:11:46,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1365121.3333333333, ans=0.1 2023-10-13 11:11:47,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1365121.3333333333, ans=0.0 2023-10-13 11:11:50,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1365121.3333333333, ans=0.125 2023-10-13 11:12:11,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1365214.6666666667, ans=0.125 2023-10-13 11:12:25,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1365261.3333333333, ans=0.125 2023-10-13 11:12:44,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1365308.0, ans=0.125 2023-10-13 11:13:06,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.783e+02 2.040e+02 2.345e+02 3.210e+02, threshold=4.081e+02, percent-clipped=0.0 2023-10-13 11:13:13,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1365448.0, ans=0.2 2023-10-13 11:13:13,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1365448.0, ans=0.1 2023-10-13 11:13:31,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1365494.6666666667, ans=0.0 2023-10-13 11:13:39,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1365541.3333333333, ans=0.05 2023-10-13 11:13:42,875 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:13:58,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1365588.0, ans=0.125 2023-10-13 11:14:09,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1365634.6666666667, ans=0.125 2023-10-13 11:14:44,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1365774.6666666667, ans=0.0 2023-10-13 11:14:51,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1365821.3333333333, ans=15.0 2023-10-13 11:15:03,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1365868.0, ans=0.1 2023-10-13 11:15:04,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1365868.0, ans=0.125 2023-10-13 11:15:10,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.789e+02 1.891e+02 2.108e+02 2.744e+02, threshold=3.783e+02, percent-clipped=0.0 2023-10-13 11:15:23,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1365914.6666666667, ans=0.04949747468305833 2023-10-13 11:15:27,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1365961.3333333333, ans=0.025 2023-10-13 11:15:31,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1365961.3333333333, ans=0.125 2023-10-13 11:15:43,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1366008.0, ans=0.05 2023-10-13 11:16:08,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1366101.3333333333, ans=0.09899494936611666 2023-10-13 11:16:21,913 INFO [train.py:1031] (0/4) Epoch 22, batch 6000, loss[loss=0.1954, simple_loss=0.286, pruned_loss=0.05242, over 15803.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2804, pruned_loss=0.04879, over 31202128.79 frames. ], batch size: 43, lr: 1.57e-03, grad_scale: 16.0 2023-10-13 11:16:25,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1366194.6666666667, ans=0.2 2023-10-13 11:16:28,119 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-10-13 11:16:35,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1366241.3333333333, ans=0.125 2023-10-13 11:16:43,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1366241.3333333333, ans=0.025 2023-10-13 11:16:52,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1366288.0, ans=0.1 2023-10-13 11:17:06,601 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.799e+02 1.953e+02 2.162e+02 2.821e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-13 11:17:25,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1366428.0, ans=0.1 2023-10-13 11:17:36,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1366474.6666666667, ans=0.0 2023-10-13 11:17:46,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1366521.3333333333, ans=0.1 2023-10-13 11:17:53,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1366521.3333333333, ans=0.2 2023-10-13 11:17:56,528 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=22.5 2023-10-13 11:18:07,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1366568.0, ans=0.0 2023-10-13 11:18:07,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1366568.0, ans=0.1 2023-10-13 11:18:24,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1366661.3333333333, ans=0.125 2023-10-13 11:18:33,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.38 vs. limit=15.0 2023-10-13 11:18:59,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1366801.3333333333, ans=0.125 2023-10-13 11:19:05,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.821e+02 2.015e+02 2.202e+02 2.685e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-13 11:19:06,760 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:19:10,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1366848.0, ans=0.125 2023-10-13 11:19:16,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.54 vs. limit=5.0 2023-10-13 11:19:23,099 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.07 vs. limit=15.0 2023-10-13 11:19:26,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.52 vs. limit=15.0 2023-10-13 11:19:31,048 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-10-13 11:19:34,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1366941.3333333333, ans=0.1 2023-10-13 11:19:42,029 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:19:42,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1366941.3333333333, ans=0.09899494936611666 2023-10-13 11:19:54,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1366988.0, ans=0.125 2023-10-13 11:20:08,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1367034.6666666667, ans=0.125 2023-10-13 11:20:08,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1367034.6666666667, ans=0.1 2023-10-13 11:20:10,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1367034.6666666667, ans=0.1 2023-10-13 11:20:13,478 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:20:43,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1367174.6666666667, ans=0.125 2023-10-13 11:21:03,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1367268.0, ans=0.0 2023-10-13 11:21:10,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.780e+02 1.932e+02 2.145e+02 2.662e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-13 11:21:21,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-10-13 11:21:26,332 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.48 vs. limit=22.5 2023-10-13 11:21:35,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1367361.3333333333, ans=0.125 2023-10-13 11:21:40,194 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=12.0 2023-10-13 11:21:52,572 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.03 vs. limit=15.0 2023-10-13 11:22:06,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1367501.3333333333, ans=0.0 2023-10-13 11:22:21,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1367548.0, ans=0.125 2023-10-13 11:22:42,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1367641.3333333333, ans=0.0 2023-10-13 11:22:52,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=22.5 2023-10-13 11:23:03,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1367688.0, ans=0.1 2023-10-13 11:23:06,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1367734.6666666667, ans=0.2 2023-10-13 11:23:13,589 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.51 vs. limit=15.0 2023-10-13 11:23:14,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.793e+02 2.039e+02 2.250e+02 3.260e+02, threshold=4.079e+02, percent-clipped=0.0 2023-10-13 11:23:18,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1367781.3333333333, ans=0.2 2023-10-13 11:23:28,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1367781.3333333333, ans=0.125 2023-10-13 11:23:37,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-10-13 11:23:42,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1367874.6666666667, ans=0.2 2023-10-13 11:24:09,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1367921.3333333333, ans=0.1 2023-10-13 11:24:24,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1368014.6666666667, ans=0.0 2023-10-13 11:24:51,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1368108.0, ans=0.125 2023-10-13 11:24:51,713 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=22.5 2023-10-13 11:25:02,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.96 vs. limit=15.0 2023-10-13 11:25:11,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1368201.3333333333, ans=0.125 2023-10-13 11:25:19,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.699e+02 1.820e+02 1.961e+02 2.790e+02, threshold=3.639e+02, percent-clipped=0.0 2023-10-13 11:25:22,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1368248.0, ans=0.5 2023-10-13 11:25:28,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1368248.0, ans=0.0 2023-10-13 11:25:50,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1368341.3333333333, ans=0.2 2023-10-13 11:25:51,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1368341.3333333333, ans=0.0 2023-10-13 11:26:27,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1368481.3333333333, ans=0.125 2023-10-13 11:26:32,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1368481.3333333333, ans=0.125 2023-10-13 11:26:34,761 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-10-13 11:26:35,077 INFO [train.py:1031] (0/4) Epoch 22, batch 6500, loss[loss=0.1964, simple_loss=0.2936, pruned_loss=0.04958, over 16815.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.281, pruned_loss=0.04903, over 31556303.94 frames. ], batch size: 175, lr: 1.57e-03, grad_scale: 16.0 2023-10-13 11:26:35,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1368528.0, ans=0.0 2023-10-13 11:26:40,003 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.99 vs. limit=10.0 2023-10-13 11:26:57,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1368574.6666666667, ans=0.1 2023-10-13 11:27:07,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1368621.3333333333, ans=0.2 2023-10-13 11:27:20,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.91 vs. limit=22.5 2023-10-13 11:27:31,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.795e+02 2.010e+02 2.225e+02 2.848e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-13 11:27:37,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1368714.6666666667, ans=0.125 2023-10-13 11:27:48,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1368761.3333333333, ans=0.0 2023-10-13 11:28:10,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1368808.0, ans=0.125 2023-10-13 11:28:16,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1368854.6666666667, ans=0.0 2023-10-13 11:28:17,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1368854.6666666667, ans=0.0 2023-10-13 11:28:29,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1368901.3333333333, ans=0.125 2023-10-13 11:28:50,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1368994.6666666667, ans=0.125 2023-10-13 11:28:55,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1368994.6666666667, ans=0.125 2023-10-13 11:29:07,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1369041.3333333333, ans=0.0 2023-10-13 11:29:09,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1369041.3333333333, ans=0.125 2023-10-13 11:29:25,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1369088.0, ans=0.0 2023-10-13 11:29:28,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1369134.6666666667, ans=0.125 2023-10-13 11:29:30,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1369134.6666666667, ans=0.1 2023-10-13 11:29:35,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1369134.6666666667, ans=0.125 2023-10-13 11:29:36,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.876e+02 2.013e+02 2.178e+02 2.772e+02, threshold=4.026e+02, percent-clipped=0.0 2023-10-13 11:29:49,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1369228.0, ans=0.0 2023-10-13 11:29:51,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-13 11:29:59,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1369274.6666666667, ans=0.125 2023-10-13 11:30:08,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1369274.6666666667, ans=0.125 2023-10-13 11:30:17,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1369321.3333333333, ans=0.125 2023-10-13 11:30:22,705 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-10-13 11:31:12,295 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.60 vs. limit=6.0 2023-10-13 11:31:18,525 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:31:22,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1369601.3333333333, ans=0.1 2023-10-13 11:31:23,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1369601.3333333333, ans=0.125 2023-10-13 11:31:25,297 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:31:25,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.15 vs. limit=15.0 2023-10-13 11:31:29,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1369601.3333333333, ans=0.0 2023-10-13 11:31:30,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.820e+02 1.974e+02 2.279e+02 3.045e+02, threshold=3.948e+02, percent-clipped=0.0 2023-10-13 11:31:37,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1369648.0, ans=0.0 2023-10-13 11:31:45,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1369694.6666666667, ans=0.125 2023-10-13 11:31:48,548 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-10-13 11:31:57,030 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-10-13 11:32:16,027 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.76 vs. limit=22.5 2023-10-13 11:32:20,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=22.5 2023-10-13 11:32:42,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.36 vs. limit=15.0 2023-10-13 11:33:09,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1369974.6666666667, ans=0.1 2023-10-13 11:33:15,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1369974.6666666667, ans=0.125 2023-10-13 11:33:33,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1370021.3333333333, ans=0.07 2023-10-13 11:33:35,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1370021.3333333333, ans=0.125 2023-10-13 11:33:48,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.701e+02 1.855e+02 2.110e+02 2.887e+02, threshold=3.710e+02, percent-clipped=0.0 2023-10-13 11:33:58,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1370114.6666666667, ans=0.0 2023-10-13 11:34:01,004 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.78 vs. limit=22.5 2023-10-13 11:34:03,652 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.78 vs. limit=10.0 2023-10-13 11:34:18,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1370208.0, ans=0.5 2023-10-13 11:34:20,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1370208.0, ans=0.125 2023-10-13 11:34:23,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1370208.0, ans=0.125 2023-10-13 11:34:36,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1370254.6666666667, ans=0.0 2023-10-13 11:34:38,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1370254.6666666667, ans=0.2 2023-10-13 11:34:53,723 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-10-13 11:34:59,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1370348.0, ans=0.125 2023-10-13 11:35:07,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1370394.6666666667, ans=0.05 2023-10-13 11:35:28,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1370441.3333333333, ans=0.1 2023-10-13 11:35:51,522 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.692e+02 1.895e+02 2.144e+02 2.983e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-13 11:36:00,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1370581.3333333333, ans=0.1 2023-10-13 11:36:05,391 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=15.0 2023-10-13 11:36:09,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1370628.0, ans=0.125 2023-10-13 11:36:24,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1370674.6666666667, ans=0.125 2023-10-13 11:36:40,654 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-10-13 11:36:59,243 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-10-13 11:37:03,928 INFO [train.py:1031] (0/4) Epoch 22, batch 7000, loss[loss=0.2054, simple_loss=0.2902, pruned_loss=0.0603, over 16889.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2812, pruned_loss=0.04882, over 31843489.39 frames. ], batch size: 110, lr: 1.57e-03, grad_scale: 32.0 2023-10-13 11:37:09,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.86 vs. limit=22.5 2023-10-13 11:37:25,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1370908.0, ans=0.0 2023-10-13 11:37:29,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1370908.0, ans=0.09899494936611666 2023-10-13 11:37:29,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1370908.0, ans=0.09899494936611666 2023-10-13 11:37:31,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1370954.6666666667, ans=0.0 2023-10-13 11:37:46,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1371001.3333333333, ans=0.05 2023-10-13 11:37:54,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.949e+02 2.157e+02 2.372e+02 3.551e+02, threshold=4.314e+02, percent-clipped=0.0 2023-10-13 11:38:01,212 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-13 11:38:17,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1371094.6666666667, ans=0.125 2023-10-13 11:38:17,245 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.17 vs. limit=15.0 2023-10-13 11:38:22,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1371141.3333333333, ans=0.0 2023-10-13 11:38:25,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1371141.3333333333, ans=0.125 2023-10-13 11:38:37,365 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.08 vs. limit=12.0 2023-10-13 11:39:29,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1371421.3333333333, ans=0.2 2023-10-13 11:39:50,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1371468.0, ans=0.125 2023-10-13 11:39:51,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1371468.0, ans=0.125 2023-10-13 11:39:53,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.914e+02 2.082e+02 2.363e+02 3.072e+02, threshold=4.164e+02, percent-clipped=0.0 2023-10-13 11:39:53,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1371514.6666666667, ans=0.1 2023-10-13 11:39:57,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1371514.6666666667, ans=0.0 2023-10-13 11:40:07,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1371561.3333333333, ans=0.125 2023-10-13 11:40:31,773 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:40:52,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1371748.0, ans=0.2 2023-10-13 11:41:01,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1371794.6666666667, ans=0.0 2023-10-13 11:41:11,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1371794.6666666667, ans=0.035 2023-10-13 11:41:38,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.84 vs. limit=15.0 2023-10-13 11:41:59,143 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.29 vs. limit=22.5 2023-10-13 11:42:05,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1371934.6666666667, ans=0.125 2023-10-13 11:42:07,514 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.382e+02 1.738e+02 1.893e+02 2.098e+02 2.690e+02, threshold=3.786e+02, percent-clipped=0.0 2023-10-13 11:42:16,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1371981.3333333333, ans=0.125 2023-10-13 11:42:18,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1371981.3333333333, ans=0.0 2023-10-13 11:42:34,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1372074.6666666667, ans=0.125 2023-10-13 11:42:46,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1372121.3333333333, ans=0.2 2023-10-13 11:43:00,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1372168.0, ans=0.1 2023-10-13 11:43:34,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1372261.3333333333, ans=0.0 2023-10-13 11:43:35,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.00 vs. limit=10.0 2023-10-13 11:43:40,009 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.27 vs. limit=15.0 2023-10-13 11:43:49,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1372308.0, ans=0.0 2023-10-13 11:43:50,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1372308.0, ans=0.1 2023-10-13 11:44:06,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1372354.6666666667, ans=0.125 2023-10-13 11:44:10,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1372401.3333333333, ans=0.1 2023-10-13 11:44:10,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1372401.3333333333, ans=0.125 2023-10-13 11:44:10,867 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-10-13 11:44:14,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1372401.3333333333, ans=0.025 2023-10-13 11:44:20,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.774e+02 1.976e+02 2.251e+02 3.269e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-13 11:44:29,503 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=15.0 2023-10-13 11:44:36,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1372494.6666666667, ans=0.125 2023-10-13 11:44:37,351 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.86 vs. limit=15.0 2023-10-13 11:45:05,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1372588.0, ans=0.0 2023-10-13 11:45:23,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1372634.6666666667, ans=0.0 2023-10-13 11:45:26,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.02 vs. limit=15.0 2023-10-13 11:45:39,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1372728.0, ans=0.125 2023-10-13 11:45:41,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1372728.0, ans=0.125 2023-10-13 11:45:55,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=15.0 2023-10-13 11:46:09,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1372821.3333333333, ans=0.125 2023-10-13 11:46:11,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1372821.3333333333, ans=10.0 2023-10-13 11:46:23,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1372868.0, ans=0.125 2023-10-13 11:46:24,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.847e+02 2.094e+02 2.480e+02 3.545e+02, threshold=4.188e+02, percent-clipped=0.0 2023-10-13 11:46:31,201 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=12.0 2023-10-13 11:46:34,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1372914.6666666667, ans=0.125 2023-10-13 11:46:41,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1372961.3333333333, ans=0.125 2023-10-13 11:46:45,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1372961.3333333333, ans=0.0 2023-10-13 11:46:54,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1373008.0, ans=0.125 2023-10-13 11:46:57,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1373008.0, ans=0.125 2023-10-13 11:46:59,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1373054.6666666667, ans=0.125 2023-10-13 11:47:08,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1373054.6666666667, ans=0.1 2023-10-13 11:47:27,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1373148.0, ans=0.1 2023-10-13 11:47:33,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1373148.0, ans=0.125 2023-10-13 11:47:37,281 INFO [train.py:1031] (0/4) Epoch 22, batch 7500, loss[loss=0.1737, simple_loss=0.2647, pruned_loss=0.04135, over 16665.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2811, pruned_loss=0.04885, over 32058326.85 frames. ], batch size: 61, lr: 1.57e-03, grad_scale: 16.0 2023-10-13 11:47:50,345 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.95 vs. limit=15.0 2023-10-13 11:48:00,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.29 vs. limit=15.0 2023-10-13 11:48:03,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1373288.0, ans=0.09899494936611666 2023-10-13 11:48:05,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1373288.0, ans=0.1 2023-10-13 11:48:07,577 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-10-13 11:48:22,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.778e+02 1.957e+02 2.210e+02 3.249e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-13 11:48:26,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.79 vs. limit=15.0 2023-10-13 11:48:55,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1373474.6666666667, ans=6.0 2023-10-13 11:49:20,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1373568.0, ans=0.125 2023-10-13 11:49:36,310 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-10-13 11:49:46,060 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.56 vs. limit=15.0 2023-10-13 11:50:05,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1373754.6666666667, ans=0.1 2023-10-13 11:50:30,353 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.791e+02 1.951e+02 2.143e+02 2.932e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-13 11:51:06,308 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-10-13 11:51:16,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1373988.0, ans=0.0 2023-10-13 11:51:17,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1373988.0, ans=0.035 2023-10-13 11:51:23,127 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-10-13 11:52:05,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1374174.6666666667, ans=0.125 2023-10-13 11:52:28,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1374268.0, ans=0.0 2023-10-13 11:52:28,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1374268.0, ans=0.1 2023-10-13 11:52:29,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.716e+02 1.956e+02 2.273e+02 4.144e+02, threshold=3.912e+02, percent-clipped=1.0 2023-10-13 11:52:44,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1374361.3333333333, ans=0.1 2023-10-13 11:52:55,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1374408.0, ans=0.07 2023-10-13 11:52:59,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1374408.0, ans=0.0 2023-10-13 11:53:00,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1374408.0, ans=0.0 2023-10-13 11:53:12,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1374454.6666666667, ans=0.0 2023-10-13 11:53:41,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1374594.6666666667, ans=0.125 2023-10-13 11:54:02,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-10-13 11:54:07,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1374688.0, ans=0.1 2023-10-13 11:54:13,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1374688.0, ans=0.125 2023-10-13 11:54:31,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1374734.6666666667, ans=0.125 2023-10-13 11:54:31,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1374734.6666666667, ans=0.0 2023-10-13 11:54:36,174 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.775e+02 1.947e+02 2.158e+02 2.896e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 11:54:42,345 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 11:54:42,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1374781.3333333333, ans=0.0 2023-10-13 11:54:45,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1374781.3333333333, ans=0.125 2023-10-13 11:54:56,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1374828.0, ans=0.125 2023-10-13 11:54:58,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1374828.0, ans=0.125 2023-10-13 11:55:13,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1374921.3333333333, ans=0.125 2023-10-13 11:55:13,687 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.58 vs. limit=10.0 2023-10-13 11:55:36,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1375014.6666666667, ans=0.125 2023-10-13 11:55:41,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1375014.6666666667, ans=0.125 2023-10-13 11:55:41,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1375014.6666666667, ans=0.1 2023-10-13 11:56:01,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1375108.0, ans=0.125 2023-10-13 11:56:12,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=1375154.6666666667, ans=0.5 2023-10-13 11:56:18,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.00 vs. limit=15.0 2023-10-13 11:56:28,828 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.86 vs. limit=10.0 2023-10-13 11:56:35,667 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.92 vs. limit=15.0 2023-10-13 11:56:39,470 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.764e+02 1.965e+02 2.377e+02 3.465e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-13 11:56:51,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1375248.0, ans=0.0 2023-10-13 11:57:12,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1375341.3333333333, ans=0.2 2023-10-13 11:57:14,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1375341.3333333333, ans=0.125 2023-10-13 11:57:46,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1375481.3333333333, ans=0.125 2023-10-13 11:57:55,382 INFO [train.py:1031] (0/4) Epoch 22, batch 8000, loss[loss=0.1936, simple_loss=0.2854, pruned_loss=0.05094, over 16675.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2806, pruned_loss=0.04842, over 32236472.47 frames. ], batch size: 202, lr: 1.57e-03, grad_scale: 32.0 2023-10-13 11:58:18,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1375621.3333333333, ans=0.025 2023-10-13 11:58:42,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.728e+02 1.885e+02 2.076e+02 3.337e+02, threshold=3.769e+02, percent-clipped=0.0 2023-10-13 11:58:57,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1375761.3333333333, ans=0.125 2023-10-13 11:59:00,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1375761.3333333333, ans=0.125 2023-10-13 11:59:58,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1375994.6666666667, ans=0.2 2023-10-13 12:00:05,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1376041.3333333333, ans=0.0 2023-10-13 12:00:15,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1376088.0, ans=0.1 2023-10-13 12:00:35,726 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.832e+02 2.043e+02 2.283e+02 3.630e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-13 12:00:38,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1376181.3333333333, ans=0.1 2023-10-13 12:00:43,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1376181.3333333333, ans=0.035 2023-10-13 12:00:59,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1376228.0, ans=0.2 2023-10-13 12:01:32,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1376321.3333333333, ans=0.5 2023-10-13 12:01:32,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1376321.3333333333, ans=0.2 2023-10-13 12:02:12,303 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-13 12:02:13,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1376461.3333333333, ans=0.125 2023-10-13 12:02:14,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1376461.3333333333, ans=0.0 2023-10-13 12:02:22,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1376508.0, ans=0.125 2023-10-13 12:02:57,839 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.775e+02 1.877e+02 2.043e+02 3.329e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-13 12:03:11,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1376694.6666666667, ans=0.125 2023-10-13 12:03:38,213 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:03:43,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1376834.6666666667, ans=0.125 2023-10-13 12:03:49,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1376834.6666666667, ans=0.125 2023-10-13 12:04:06,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1376881.3333333333, ans=0.125 2023-10-13 12:04:20,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1376928.0, ans=0.0 2023-10-13 12:04:58,279 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.743e+02 1.964e+02 2.153e+02 4.153e+02, threshold=3.927e+02, percent-clipped=1.0 2023-10-13 12:05:08,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1377161.3333333333, ans=0.125 2023-10-13 12:05:20,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1377208.0, ans=0.125 2023-10-13 12:05:24,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1377208.0, ans=0.1 2023-10-13 12:05:26,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.77 vs. limit=15.0 2023-10-13 12:05:38,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-10-13 12:05:53,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1377348.0, ans=0.125 2023-10-13 12:05:59,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.76 vs. limit=10.0 2023-10-13 12:06:34,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1377488.0, ans=0.2 2023-10-13 12:06:52,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1377534.6666666667, ans=0.0 2023-10-13 12:06:54,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-13 12:07:02,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.875e+02 2.037e+02 2.268e+02 3.290e+02, threshold=4.074e+02, percent-clipped=0.0 2023-10-13 12:07:12,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1377628.0, ans=0.125 2023-10-13 12:07:13,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1377628.0, ans=0.0 2023-10-13 12:07:15,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1377628.0, ans=0.0 2023-10-13 12:07:19,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1377628.0, ans=0.1 2023-10-13 12:07:43,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1377721.3333333333, ans=0.125 2023-10-13 12:07:44,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1377721.3333333333, ans=0.0 2023-10-13 12:07:52,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1377768.0, ans=0.125 2023-10-13 12:08:03,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=15.0 2023-10-13 12:08:20,415 INFO [train.py:1031] (0/4) Epoch 22, batch 8500, loss[loss=0.2023, simple_loss=0.2855, pruned_loss=0.05953, over 16413.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.281, pruned_loss=0.04846, over 32369757.21 frames. ], batch size: 44, lr: 1.56e-03, grad_scale: 16.0 2023-10-13 12:08:22,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1377861.3333333333, ans=0.0 2023-10-13 12:08:34,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1377908.0, ans=0.125 2023-10-13 12:08:35,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1377908.0, ans=0.0 2023-10-13 12:08:46,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1377954.6666666667, ans=0.1 2023-10-13 12:09:04,176 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.37 vs. limit=15.0 2023-10-13 12:09:10,809 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.798e+02 1.973e+02 2.208e+02 3.254e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-13 12:09:18,689 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.21 vs. limit=15.0 2023-10-13 12:09:38,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1378141.3333333333, ans=0.125 2023-10-13 12:09:50,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1378188.0, ans=0.1 2023-10-13 12:09:52,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1378188.0, ans=0.125 2023-10-13 12:10:00,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-10-13 12:10:32,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1378328.0, ans=0.125 2023-10-13 12:10:38,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1378328.0, ans=0.125 2023-10-13 12:10:43,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1378374.6666666667, ans=0.125 2023-10-13 12:10:44,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1378374.6666666667, ans=0.2 2023-10-13 12:10:50,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1378374.6666666667, ans=0.125 2023-10-13 12:11:20,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.730e+02 1.932e+02 2.176e+02 3.111e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-13 12:11:27,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1378514.6666666667, ans=0.0 2023-10-13 12:11:38,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1378561.3333333333, ans=0.07 2023-10-13 12:12:27,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.26 vs. limit=15.0 2023-10-13 12:12:28,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1378748.0, ans=0.125 2023-10-13 12:12:41,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1378794.6666666667, ans=0.1 2023-10-13 12:13:00,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-10-13 12:13:07,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1378888.0, ans=0.125 2023-10-13 12:13:33,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.704e+02 1.930e+02 2.181e+02 3.007e+02, threshold=3.861e+02, percent-clipped=0.0 2023-10-13 12:14:06,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1379074.6666666667, ans=0.04949747468305833 2023-10-13 12:14:11,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=1379074.6666666667, ans=22.5 2023-10-13 12:14:14,817 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:14:31,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1379168.0, ans=0.125 2023-10-13 12:14:37,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1379168.0, ans=0.125 2023-10-13 12:14:37,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1379168.0, ans=0.125 2023-10-13 12:14:48,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1379214.6666666667, ans=0.0 2023-10-13 12:14:55,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1379261.3333333333, ans=0.2 2023-10-13 12:14:56,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1379261.3333333333, ans=0.125 2023-10-13 12:15:35,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1379354.6666666667, ans=0.0 2023-10-13 12:15:45,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1379401.3333333333, ans=0.07 2023-10-13 12:15:46,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1379401.3333333333, ans=0.0 2023-10-13 12:15:49,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379448.0, ans=0.1 2023-10-13 12:15:52,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.663e+02 1.862e+02 2.041e+02 2.848e+02, threshold=3.723e+02, percent-clipped=0.0 2023-10-13 12:15:56,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1379448.0, ans=0.0 2023-10-13 12:16:02,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-10-13 12:16:10,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1379494.6666666667, ans=0.125 2023-10-13 12:16:12,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1379541.3333333333, ans=0.125 2023-10-13 12:16:16,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1379541.3333333333, ans=0.0 2023-10-13 12:16:23,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1379588.0, ans=15.0 2023-10-13 12:16:26,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1379588.0, ans=0.125 2023-10-13 12:16:31,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1379588.0, ans=0.1 2023-10-13 12:16:32,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1379588.0, ans=0.2 2023-10-13 12:16:49,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379681.3333333333, ans=0.1 2023-10-13 12:16:55,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1379681.3333333333, ans=0.0 2023-10-13 12:17:11,568 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.14 vs. limit=22.5 2023-10-13 12:17:13,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1379774.6666666667, ans=0.0 2023-10-13 12:17:35,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1379868.0, ans=0.2 2023-10-13 12:17:49,514 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.786e+02 1.951e+02 2.121e+02 2.852e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-13 12:18:22,696 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.42 vs. limit=15.0 2023-10-13 12:18:23,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380054.6666666667, ans=0.1 2023-10-13 12:18:45,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.11 vs. limit=15.0 2023-10-13 12:18:56,733 INFO [train.py:1031] (0/4) Epoch 22, batch 9000, loss[loss=0.1745, simple_loss=0.2689, pruned_loss=0.04002, over 16479.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2805, pruned_loss=0.04837, over 32476796.12 frames. ], batch size: 50, lr: 1.56e-03, grad_scale: 32.0 2023-10-13 12:19:02,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.70 vs. limit=10.0 2023-10-13 12:19:12,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1380241.3333333333, ans=0.0 2023-10-13 12:19:22,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.77 vs. limit=15.0 2023-10-13 12:19:24,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1380288.0, ans=0.125 2023-10-13 12:19:39,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380334.6666666667, ans=0.1 2023-10-13 12:19:47,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.791e+02 1.958e+02 2.140e+02 2.839e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-13 12:20:07,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.32 vs. limit=12.0 2023-10-13 12:20:08,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1380474.6666666667, ans=0.0 2023-10-13 12:20:16,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380474.6666666667, ans=0.1 2023-10-13 12:20:26,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1380521.3333333333, ans=0.125 2023-10-13 12:20:27,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1380521.3333333333, ans=0.2 2023-10-13 12:20:33,197 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=6.0 2023-10-13 12:20:37,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.17 vs. limit=22.5 2023-10-13 12:20:49,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1380614.6666666667, ans=0.125 2023-10-13 12:21:25,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.47 vs. limit=22.5 2023-10-13 12:21:26,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1380801.3333333333, ans=0.1 2023-10-13 12:21:38,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1380848.0, ans=0.125 2023-10-13 12:21:38,831 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.380e+02 1.726e+02 1.899e+02 2.091e+02 2.641e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-13 12:21:58,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1380941.3333333333, ans=0.125 2023-10-13 12:22:06,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1380941.3333333333, ans=0.07 2023-10-13 12:22:06,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1380941.3333333333, ans=0.125 2023-10-13 12:22:07,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1380988.0, ans=0.125 2023-10-13 12:22:12,376 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.17 vs. limit=15.0 2023-10-13 12:22:18,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1381034.6666666667, ans=0.125 2023-10-13 12:22:22,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1381034.6666666667, ans=0.0 2023-10-13 12:22:45,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1381128.0, ans=0.1 2023-10-13 12:22:58,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1381174.6666666667, ans=0.0 2023-10-13 12:23:09,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1381221.3333333333, ans=0.125 2023-10-13 12:23:17,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1381268.0, ans=0.125 2023-10-13 12:23:17,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1381268.0, ans=0.0 2023-10-13 12:23:21,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1381268.0, ans=0.2 2023-10-13 12:23:23,727 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.37 vs. limit=12.0 2023-10-13 12:23:24,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1381314.6666666667, ans=0.0 2023-10-13 12:23:25,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.811e+02 2.014e+02 2.178e+02 3.000e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-13 12:23:26,364 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-296000.pt 2023-10-13 12:23:30,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-10-13 12:23:31,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.04 vs. limit=22.5 2023-10-13 12:23:38,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1381361.3333333333, ans=0.0 2023-10-13 12:23:47,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1381408.0, ans=0.0 2023-10-13 12:23:48,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1381408.0, ans=0.2 2023-10-13 12:24:05,960 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.39 vs. limit=15.0 2023-10-13 12:24:31,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1381594.6666666667, ans=0.0 2023-10-13 12:24:34,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1381594.6666666667, ans=0.09899494936611666 2023-10-13 12:24:40,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1381594.6666666667, ans=0.0 2023-10-13 12:24:42,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1381641.3333333333, ans=0.125 2023-10-13 12:24:58,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1381688.0, ans=0.0 2023-10-13 12:25:13,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.68 vs. limit=15.0 2023-10-13 12:25:23,681 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-10-13 12:25:27,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.777e+02 1.930e+02 2.177e+02 2.869e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-13 12:25:36,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1381828.0, ans=0.1 2023-10-13 12:26:03,065 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:26:29,548 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-10-13 12:26:39,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1382014.6666666667, ans=0.125 2023-10-13 12:26:55,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-10-13 12:26:55,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.31 vs. limit=22.5 2023-10-13 12:26:58,363 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-10-13 12:27:02,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1382108.0, ans=0.2 2023-10-13 12:27:12,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1382108.0, ans=0.125 2023-10-13 12:27:15,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1382154.6666666667, ans=0.125 2023-10-13 12:27:33,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1382201.3333333333, ans=0.125 2023-10-13 12:27:37,878 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:27:41,617 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:27:44,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.828e+02 1.987e+02 2.304e+02 3.669e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-13 12:28:21,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.05 vs. limit=22.5 2023-10-13 12:28:24,539 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.11 vs. limit=12.0 2023-10-13 12:28:49,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1382481.3333333333, ans=0.125 2023-10-13 12:28:55,095 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-10-13 12:28:55,523 INFO [train.py:1031] (0/4) Epoch 22, batch 9500, loss[loss=0.1821, simple_loss=0.2902, pruned_loss=0.03701, over 16877.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2812, pruned_loss=0.04858, over 32555656.95 frames. ], batch size: 104, lr: 1.56e-03, grad_scale: 32.0 2023-10-13 12:29:01,202 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.86 vs. limit=12.0 2023-10-13 12:29:10,125 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:29:10,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1382574.6666666667, ans=0.1 2023-10-13 12:29:26,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1382621.3333333333, ans=0.125 2023-10-13 12:29:47,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.862e+02 2.055e+02 2.288e+02 2.911e+02, threshold=4.109e+02, percent-clipped=0.0 2023-10-13 12:29:55,291 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.56 vs. limit=10.0 2023-10-13 12:30:05,579 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:30:25,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1382854.6666666667, ans=0.05 2023-10-13 12:30:31,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1382901.3333333333, ans=0.0 2023-10-13 12:30:45,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1382948.0, ans=0.125 2023-10-13 12:30:46,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1382948.0, ans=0.125 2023-10-13 12:30:47,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=12.0 2023-10-13 12:31:05,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1383041.3333333333, ans=0.04949747468305833 2023-10-13 12:31:08,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1383041.3333333333, ans=0.0 2023-10-13 12:31:10,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1383041.3333333333, ans=0.125 2023-10-13 12:31:14,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1383041.3333333333, ans=0.125 2023-10-13 12:31:46,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.794e+02 1.928e+02 2.166e+02 3.044e+02, threshold=3.855e+02, percent-clipped=0.0 2023-10-13 12:31:49,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1383181.3333333333, ans=0.125 2023-10-13 12:32:04,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1383228.0, ans=0.125 2023-10-13 12:32:04,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-10-13 12:32:35,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1383368.0, ans=0.0 2023-10-13 12:32:38,701 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=22.5 2023-10-13 12:33:01,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1383461.3333333333, ans=0.125 2023-10-13 12:33:07,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1383461.3333333333, ans=0.1 2023-10-13 12:33:12,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.99 vs. limit=15.0 2023-10-13 12:33:20,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.52 vs. limit=22.5 2023-10-13 12:33:39,032 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:33:46,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1383648.0, ans=0.0 2023-10-13 12:33:47,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1383648.0, ans=0.0 2023-10-13 12:33:51,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.758e+02 1.946e+02 2.173e+02 3.220e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-13 12:34:04,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1383694.6666666667, ans=0.125 2023-10-13 12:34:05,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1383694.6666666667, ans=0.0 2023-10-13 12:34:13,976 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.78 vs. limit=5.0 2023-10-13 12:35:03,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1383928.0, ans=0.0 2023-10-13 12:35:05,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1383928.0, ans=0.125 2023-10-13 12:35:08,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1383928.0, ans=0.09899494936611666 2023-10-13 12:35:09,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1383974.6666666667, ans=0.0 2023-10-13 12:35:10,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1383974.6666666667, ans=0.0 2023-10-13 12:35:17,821 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-10-13 12:35:20,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-10-13 12:35:24,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1384021.3333333333, ans=0.2 2023-10-13 12:35:49,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.746e+02 1.893e+02 2.072e+02 2.744e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-13 12:36:15,886 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.18 vs. limit=12.0 2023-10-13 12:36:21,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1384254.6666666667, ans=0.5 2023-10-13 12:36:51,367 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-10-13 12:36:53,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1384394.6666666667, ans=0.0 2023-10-13 12:37:01,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1384394.6666666667, ans=0.125 2023-10-13 12:37:06,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1384441.3333333333, ans=0.125 2023-10-13 12:37:09,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1384441.3333333333, ans=0.2 2023-10-13 12:37:37,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1384581.3333333333, ans=0.1 2023-10-13 12:37:38,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1384581.3333333333, ans=0.2 2023-10-13 12:37:41,641 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.715e+02 1.819e+02 2.000e+02 2.625e+02, threshold=3.637e+02, percent-clipped=0.0 2023-10-13 12:37:47,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1384628.0, ans=0.0 2023-10-13 12:38:42,948 INFO [train.py:1031] (0/4) Epoch 22, batch 10000, loss[loss=0.1764, simple_loss=0.2707, pruned_loss=0.04108, over 16261.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2802, pruned_loss=0.04828, over 32581260.82 frames. ], batch size: 50, lr: 1.56e-03, grad_scale: 16.0 2023-10-13 12:38:44,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.84 vs. limit=10.0 2023-10-13 12:38:46,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1384861.3333333333, ans=0.2 2023-10-13 12:38:48,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1384861.3333333333, ans=0.015 2023-10-13 12:38:53,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1384908.0, ans=10.0 2023-10-13 12:39:01,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1384908.0, ans=0.125 2023-10-13 12:39:06,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1384954.6666666667, ans=0.2 2023-10-13 12:39:36,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.783e+02 1.937e+02 2.133e+02 3.265e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-13 12:39:40,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1385048.0, ans=0.0 2023-10-13 12:39:49,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1385094.6666666667, ans=0.125 2023-10-13 12:40:23,345 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.58 vs. limit=6.0 2023-10-13 12:40:25,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1385234.6666666667, ans=0.125 2023-10-13 12:40:27,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1385234.6666666667, ans=0.07 2023-10-13 12:40:37,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.82 vs. limit=5.0 2023-10-13 12:40:47,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1385281.3333333333, ans=0.0 2023-10-13 12:40:47,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1385281.3333333333, ans=0.2 2023-10-13 12:40:49,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1385328.0, ans=0.125 2023-10-13 12:40:57,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1385328.0, ans=0.125 2023-10-13 12:41:03,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1385374.6666666667, ans=0.0 2023-10-13 12:41:10,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1385374.6666666667, ans=0.125 2023-10-13 12:41:45,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.791e+02 1.981e+02 2.268e+02 3.504e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-13 12:41:47,949 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:41:49,079 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.89 vs. limit=15.0 2023-10-13 12:41:54,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1385561.3333333333, ans=0.125 2023-10-13 12:42:01,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1385608.0, ans=0.125 2023-10-13 12:42:02,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1385608.0, ans=0.0 2023-10-13 12:42:25,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1385701.3333333333, ans=0.125 2023-10-13 12:42:29,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1385701.3333333333, ans=0.125 2023-10-13 12:42:32,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1385701.3333333333, ans=0.125 2023-10-13 12:42:36,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1385748.0, ans=0.0 2023-10-13 12:42:37,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1385748.0, ans=0.0 2023-10-13 12:42:47,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1385794.6666666667, ans=0.0 2023-10-13 12:42:48,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1385794.6666666667, ans=0.0 2023-10-13 12:42:58,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1385794.6666666667, ans=0.125 2023-10-13 12:43:14,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1385841.3333333333, ans=0.1 2023-10-13 12:43:29,345 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.27 vs. limit=15.0 2023-10-13 12:43:38,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1385934.6666666667, ans=0.07 2023-10-13 12:43:47,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1385981.3333333333, ans=0.125 2023-10-13 12:43:49,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.850e+02 1.977e+02 2.167e+02 1.134e+03, threshold=3.954e+02, percent-clipped=1.0 2023-10-13 12:44:17,120 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-10-13 12:44:53,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1386214.6666666667, ans=0.1 2023-10-13 12:44:57,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1386261.3333333333, ans=0.125 2023-10-13 12:45:09,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1386308.0, ans=0.09899494936611666 2023-10-13 12:45:56,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.723e+02 1.882e+02 2.099e+02 3.050e+02, threshold=3.764e+02, percent-clipped=0.0 2023-10-13 12:46:04,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1386494.6666666667, ans=0.0 2023-10-13 12:46:27,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1386588.0, ans=0.125 2023-10-13 12:46:33,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-10-13 12:46:54,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1386681.3333333333, ans=0.0 2023-10-13 12:46:56,541 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.81 vs. limit=15.0 2023-10-13 12:47:10,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1386728.0, ans=0.125 2023-10-13 12:47:37,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1386821.3333333333, ans=0.125 2023-10-13 12:47:50,160 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-10-13 12:48:07,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.727e+02 1.862e+02 2.093e+02 2.878e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-13 12:48:23,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1387008.0, ans=0.125 2023-10-13 12:48:34,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1387054.6666666667, ans=0.05 2023-10-13 12:48:38,968 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.90 vs. limit=15.0 2023-10-13 12:48:43,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=1387054.6666666667, ans=0.05 2023-10-13 12:48:46,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1387101.3333333333, ans=0.125 2023-10-13 12:49:06,541 INFO [train.py:1031] (0/4) Epoch 22, batch 10500, loss[loss=0.1841, simple_loss=0.2781, pruned_loss=0.04503, over 16240.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2808, pruned_loss=0.04843, over 32634328.29 frames. ], batch size: 50, lr: 1.56e-03, grad_scale: 16.0 2023-10-13 12:49:22,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1387241.3333333333, ans=0.09899494936611666 2023-10-13 12:49:34,900 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2023-10-13 12:49:37,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1387288.0, ans=0.0 2023-10-13 12:49:42,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1387334.6666666667, ans=0.1 2023-10-13 12:49:46,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=22.5 2023-10-13 12:49:57,457 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.727e+02 1.871e+02 2.053e+02 2.736e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-13 12:49:59,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1387381.3333333333, ans=0.125 2023-10-13 12:50:32,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1387521.3333333333, ans=0.125 2023-10-13 12:50:37,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1387521.3333333333, ans=0.0 2023-10-13 12:51:01,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1387614.6666666667, ans=0.0 2023-10-13 12:51:03,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1387614.6666666667, ans=0.125 2023-10-13 12:51:15,898 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-10-13 12:51:29,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1387708.0, ans=0.125 2023-10-13 12:51:30,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1387708.0, ans=0.125 2023-10-13 12:51:32,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1387708.0, ans=0.0 2023-10-13 12:51:35,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1387754.6666666667, ans=0.125 2023-10-13 12:51:38,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-10-13 12:51:41,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1387754.6666666667, ans=0.125 2023-10-13 12:52:06,423 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.784e+02 1.885e+02 2.137e+02 2.709e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-13 12:52:06,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1387848.0, ans=0.0 2023-10-13 12:52:11,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1387894.6666666667, ans=10.0 2023-10-13 12:52:47,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1388034.6666666667, ans=0.2 2023-10-13 12:53:20,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1388128.0, ans=0.125 2023-10-13 12:53:59,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1388268.0, ans=0.2 2023-10-13 12:54:06,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1388314.6666666667, ans=0.0 2023-10-13 12:54:10,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.818e+02 2.018e+02 2.307e+02 3.203e+02, threshold=4.035e+02, percent-clipped=0.0 2023-10-13 12:54:12,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388314.6666666667, ans=0.1 2023-10-13 12:54:18,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1388361.3333333333, ans=0.125 2023-10-13 12:54:59,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1388501.3333333333, ans=0.05 2023-10-13 12:55:07,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1388548.0, ans=0.0 2023-10-13 12:55:17,135 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.47 vs. limit=22.5 2023-10-13 12:55:18,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1388594.6666666667, ans=0.0 2023-10-13 12:55:19,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388594.6666666667, ans=0.1 2023-10-13 12:55:24,344 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 12:55:27,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1388641.3333333333, ans=0.125 2023-10-13 12:55:29,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1388641.3333333333, ans=0.125 2023-10-13 12:55:45,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1388688.0, ans=0.125 2023-10-13 12:55:46,138 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-10-13 12:56:08,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.796e+02 2.061e+02 2.325e+02 3.179e+02, threshold=4.122e+02, percent-clipped=0.0 2023-10-13 12:56:49,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1388968.0, ans=0.125 2023-10-13 12:57:02,023 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.33 vs. limit=22.5 2023-10-13 12:57:10,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1389061.3333333333, ans=0.2 2023-10-13 12:57:40,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1389154.6666666667, ans=0.125 2023-10-13 12:58:05,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.372e+02 1.684e+02 1.875e+02 2.138e+02 2.961e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 12:58:10,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1389294.6666666667, ans=0.125 2023-10-13 12:58:23,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1389341.3333333333, ans=0.125 2023-10-13 12:58:38,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1389388.0, ans=0.2 2023-10-13 12:58:47,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1389434.6666666667, ans=0.1 2023-10-13 12:59:06,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1389481.3333333333, ans=0.1 2023-10-13 12:59:09,940 INFO [train.py:1031] (0/4) Epoch 22, batch 11000, loss[loss=0.2172, simple_loss=0.3008, pruned_loss=0.06677, over 16084.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2808, pruned_loss=0.04857, over 32653394.26 frames. ], batch size: 296, lr: 1.56e-03, grad_scale: 8.0 2023-10-13 12:59:18,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1389528.0, ans=0.0 2023-10-13 12:59:38,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1389621.3333333333, ans=0.0 2023-10-13 12:59:40,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1389621.3333333333, ans=0.125 2023-10-13 13:00:08,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.797e+02 1.991e+02 2.175e+02 2.656e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-13 13:00:17,041 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-10-13 13:00:22,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1389808.0, ans=0.125 2023-10-13 13:00:42,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1389854.6666666667, ans=0.125 2023-10-13 13:00:49,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1389901.3333333333, ans=0.125 2023-10-13 13:01:51,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1390088.0, ans=0.125 2023-10-13 13:01:55,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1390134.6666666667, ans=0.2 2023-10-13 13:01:55,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.94 vs. limit=6.0 2023-10-13 13:01:59,681 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:02:07,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1390134.6666666667, ans=0.125 2023-10-13 13:02:17,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1390181.3333333333, ans=0.0 2023-10-13 13:02:17,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1390181.3333333333, ans=0.125 2023-10-13 13:02:19,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.746e+02 1.993e+02 2.338e+02 3.731e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-13 13:02:41,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1390274.6666666667, ans=0.5 2023-10-13 13:02:52,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.58 vs. limit=15.0 2023-10-13 13:03:02,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1390368.0, ans=0.05 2023-10-13 13:03:05,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1390368.0, ans=0.125 2023-10-13 13:03:18,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1390414.6666666667, ans=0.125 2023-10-13 13:03:26,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1390461.3333333333, ans=0.125 2023-10-13 13:03:35,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1390461.3333333333, ans=10.0 2023-10-13 13:04:22,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.706e+02 1.875e+02 2.085e+02 2.870e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 13:04:26,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1390694.6666666667, ans=0.125 2023-10-13 13:04:56,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1390788.0, ans=0.0 2023-10-13 13:05:19,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1390881.3333333333, ans=0.05 2023-10-13 13:05:36,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1390881.3333333333, ans=0.125 2023-10-13 13:05:50,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1390974.6666666667, ans=0.1 2023-10-13 13:06:09,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1391021.3333333333, ans=0.2 2023-10-13 13:06:16,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1391068.0, ans=0.125 2023-10-13 13:06:37,980 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.712e+02 1.890e+02 2.236e+02 3.127e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-13 13:06:41,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1391161.3333333333, ans=0.1 2023-10-13 13:06:41,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1391161.3333333333, ans=0.1 2023-10-13 13:06:55,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1391208.0, ans=0.125 2023-10-13 13:07:22,275 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-10-13 13:07:53,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1391441.3333333333, ans=0.125 2023-10-13 13:08:20,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1391534.6666666667, ans=0.125 2023-10-13 13:08:22,994 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:08:25,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1391534.6666666667, ans=0.125 2023-10-13 13:08:37,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1391581.3333333333, ans=0.1 2023-10-13 13:08:44,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.794e+02 1.956e+02 2.153e+02 2.781e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-13 13:08:49,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1391628.0, ans=0.1 2023-10-13 13:08:55,319 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.50 vs. limit=6.0 2023-10-13 13:09:32,518 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-13 13:09:47,524 INFO [train.py:1031] (0/4) Epoch 22, batch 11500, loss[loss=0.1973, simple_loss=0.2898, pruned_loss=0.05237, over 16022.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2803, pruned_loss=0.04847, over 32659450.52 frames. ], batch size: 43, lr: 1.56e-03, grad_scale: 16.0 2023-10-13 13:09:58,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1391861.3333333333, ans=0.125 2023-10-13 13:09:58,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-13 13:10:15,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1391954.6666666667, ans=0.09899494936611666 2023-10-13 13:10:44,268 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.768e+02 1.954e+02 2.200e+02 3.243e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-13 13:10:50,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1392094.6666666667, ans=0.125 2023-10-13 13:11:17,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1392188.0, ans=0.0 2023-10-13 13:11:30,427 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.73 vs. limit=15.0 2023-10-13 13:11:43,685 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.69 vs. limit=15.0 2023-10-13 13:12:04,047 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=12.0 2023-10-13 13:12:42,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1392514.6666666667, ans=0.125 2023-10-13 13:12:49,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.763e+02 2.002e+02 2.307e+02 3.774e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-13 13:13:00,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1392561.3333333333, ans=0.07 2023-10-13 13:13:04,391 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:13:10,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1392608.0, ans=0.1 2023-10-13 13:13:22,036 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.83 vs. limit=6.0 2023-10-13 13:14:00,304 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:14:05,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1392841.3333333333, ans=0.125 2023-10-13 13:14:48,144 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.343e+02 1.870e+02 2.107e+02 2.456e+02 3.402e+02, threshold=4.215e+02, percent-clipped=0.0 2023-10-13 13:14:55,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-13 13:15:05,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1393028.0, ans=0.0 2023-10-13 13:15:07,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1393028.0, ans=0.125 2023-10-13 13:15:16,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1393074.6666666667, ans=0.1 2023-10-13 13:15:34,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1393121.3333333333, ans=0.1 2023-10-13 13:15:49,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1393214.6666666667, ans=0.0 2023-10-13 13:16:04,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1393261.3333333333, ans=10.0 2023-10-13 13:16:04,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1393261.3333333333, ans=0.1 2023-10-13 13:16:16,543 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-10-13 13:16:45,479 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-10-13 13:17:05,584 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.746e+02 1.911e+02 2.144e+02 3.347e+02, threshold=3.821e+02, percent-clipped=0.0 2023-10-13 13:17:14,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1393494.6666666667, ans=0.1 2023-10-13 13:17:35,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1393588.0, ans=0.125 2023-10-13 13:17:50,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1393634.6666666667, ans=0.125 2023-10-13 13:18:17,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1393728.0, ans=0.125 2023-10-13 13:18:21,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1393728.0, ans=0.0 2023-10-13 13:18:33,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1393774.6666666667, ans=0.125 2023-10-13 13:18:37,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1393821.3333333333, ans=0.125 2023-10-13 13:18:37,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.61 vs. limit=15.0 2023-10-13 13:18:38,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1393821.3333333333, ans=0.125 2023-10-13 13:18:41,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1393821.3333333333, ans=0.2 2023-10-13 13:18:59,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1393868.0, ans=0.2 2023-10-13 13:19:08,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1393914.6666666667, ans=0.125 2023-10-13 13:19:13,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1393914.6666666667, ans=0.0 2023-10-13 13:19:14,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.854e+02 2.030e+02 2.223e+02 3.205e+02, threshold=4.059e+02, percent-clipped=0.0 2023-10-13 13:19:17,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1393961.3333333333, ans=0.125 2023-10-13 13:19:24,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1393961.3333333333, ans=0.125 2023-10-13 13:19:31,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1394008.0, ans=0.125 2023-10-13 13:19:41,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1394054.6666666667, ans=0.5 2023-10-13 13:19:47,013 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-10-13 13:19:52,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1394101.3333333333, ans=0.1 2023-10-13 13:20:02,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1394101.3333333333, ans=0.0 2023-10-13 13:20:16,298 INFO [train.py:1031] (0/4) Epoch 22, batch 12000, loss[loss=0.1733, simple_loss=0.2746, pruned_loss=0.03596, over 16974.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2805, pruned_loss=0.04814, over 32715910.14 frames. ], batch size: 93, lr: 1.55e-03, grad_scale: 32.0 2023-10-13 13:20:34,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1394241.3333333333, ans=0.04949747468305833 2023-10-13 13:21:02,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1394334.6666666667, ans=0.0 2023-10-13 13:21:08,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1394381.3333333333, ans=0.1 2023-10-13 13:21:14,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.790e+02 1.932e+02 2.111e+02 2.818e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-13 13:21:39,528 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-10-13 13:21:43,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1394521.3333333333, ans=0.0 2023-10-13 13:21:55,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1394568.0, ans=0.125 2023-10-13 13:22:01,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1394568.0, ans=0.2 2023-10-13 13:22:06,707 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.33 vs. limit=15.0 2023-10-13 13:22:13,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1394614.6666666667, ans=0.125 2023-10-13 13:22:22,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1394661.3333333333, ans=0.0 2023-10-13 13:22:34,403 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-10-13 13:22:45,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1394754.6666666667, ans=0.0 2023-10-13 13:22:54,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1394801.3333333333, ans=0.125 2023-10-13 13:22:56,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-10-13 13:23:07,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1394848.0, ans=0.125 2023-10-13 13:23:09,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1394848.0, ans=0.0 2023-10-13 13:23:10,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.722e+02 1.945e+02 2.246e+02 3.327e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-13 13:23:12,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1394894.6666666667, ans=0.05 2023-10-13 13:23:16,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1394894.6666666667, ans=0.0 2023-10-13 13:23:26,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1394941.3333333333, ans=0.125 2023-10-13 13:23:27,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1394941.3333333333, ans=0.0 2023-10-13 13:23:46,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1394988.0, ans=0.125 2023-10-13 13:23:49,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1394988.0, ans=0.125 2023-10-13 13:24:29,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1395174.6666666667, ans=0.0 2023-10-13 13:24:32,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1395174.6666666667, ans=0.125 2023-10-13 13:24:33,911 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.24 vs. limit=6.0 2023-10-13 13:24:40,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.70 vs. limit=22.5 2023-10-13 13:24:53,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1395268.0, ans=0.125 2023-10-13 13:25:13,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1395314.6666666667, ans=0.125 2023-10-13 13:25:13,853 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.920e+02 2.061e+02 2.293e+02 5.987e+02, threshold=4.122e+02, percent-clipped=1.0 2023-10-13 13:25:24,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1395408.0, ans=0.125 2023-10-13 13:25:30,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-10-13 13:25:31,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1395408.0, ans=0.0 2023-10-13 13:25:40,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1395454.6666666667, ans=0.125 2023-10-13 13:26:23,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1395641.3333333333, ans=0.125 2023-10-13 13:27:07,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.786e+02 2.021e+02 2.384e+02 3.580e+02, threshold=4.042e+02, percent-clipped=0.0 2023-10-13 13:27:10,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1395828.0, ans=0.125 2023-10-13 13:27:21,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1395874.6666666667, ans=0.125 2023-10-13 13:27:35,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1395874.6666666667, ans=0.125 2023-10-13 13:27:45,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1395921.3333333333, ans=0.1 2023-10-13 13:27:58,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1396014.6666666667, ans=0.125 2023-10-13 13:28:09,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1396014.6666666667, ans=0.0 2023-10-13 13:28:12,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1396061.3333333333, ans=0.125 2023-10-13 13:28:33,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1396108.0, ans=0.2 2023-10-13 13:28:56,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1396201.3333333333, ans=0.1 2023-10-13 13:28:56,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1396201.3333333333, ans=0.5 2023-10-13 13:29:14,353 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.823e+02 2.014e+02 2.328e+02 3.441e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-13 13:29:19,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396294.6666666667, ans=0.1 2023-10-13 13:29:34,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1396341.3333333333, ans=0.0 2023-10-13 13:29:51,268 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.73 vs. limit=15.0 2023-10-13 13:30:16,223 INFO [train.py:1031] (0/4) Epoch 22, batch 12500, loss[loss=0.1781, simple_loss=0.2736, pruned_loss=0.04128, over 16905.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2801, pruned_loss=0.04812, over 32729992.41 frames. ], batch size: 138, lr: 1.55e-03, grad_scale: 32.0 2023-10-13 13:30:21,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1396528.0, ans=0.1 2023-10-13 13:30:37,758 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-10-13 13:30:37,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.59 vs. limit=22.5 2023-10-13 13:30:49,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1396621.3333333333, ans=10.0 2023-10-13 13:30:56,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1396668.0, ans=0.125 2023-10-13 13:31:01,401 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.80 vs. limit=15.0 2023-10-13 13:31:03,909 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-10-13 13:31:06,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1396714.6666666667, ans=0.0 2023-10-13 13:31:17,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.752e+02 1.864e+02 2.137e+02 2.634e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-13 13:31:54,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1396901.3333333333, ans=0.0 2023-10-13 13:31:54,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396901.3333333333, ans=0.1 2023-10-13 13:32:07,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1396948.0, ans=0.125 2023-10-13 13:32:08,327 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.96 vs. limit=22.5 2023-10-13 13:32:10,722 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.73 vs. limit=10.0 2023-10-13 13:32:46,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1397088.0, ans=0.0 2023-10-13 13:32:58,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1397134.6666666667, ans=0.0 2023-10-13 13:33:09,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1397181.3333333333, ans=0.125 2023-10-13 13:33:16,850 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.753e+02 1.913e+02 2.099e+02 2.465e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-13 13:33:36,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-10-13 13:33:49,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1397321.3333333333, ans=0.125 2023-10-13 13:34:28,569 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:34:49,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1397554.6666666667, ans=0.125 2023-10-13 13:35:20,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.748e+02 1.921e+02 2.149e+02 3.259e+02, threshold=3.841e+02, percent-clipped=0.0 2023-10-13 13:35:24,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1397694.6666666667, ans=0.125 2023-10-13 13:35:28,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1397694.6666666667, ans=0.05 2023-10-13 13:35:42,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1397788.0, ans=0.0 2023-10-13 13:35:48,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1397788.0, ans=0.0 2023-10-13 13:35:49,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1397788.0, ans=0.0 2023-10-13 13:35:56,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1397834.6666666667, ans=0.2 2023-10-13 13:35:57,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1397834.6666666667, ans=0.125 2023-10-13 13:36:17,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1397928.0, ans=0.0 2023-10-13 13:36:27,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1397928.0, ans=0.035 2023-10-13 13:36:38,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1397974.6666666667, ans=0.125 2023-10-13 13:36:39,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1397974.6666666667, ans=0.125 2023-10-13 13:36:58,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1398068.0, ans=0.0 2023-10-13 13:37:01,559 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.21 vs. limit=15.0 2023-10-13 13:37:20,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.805e+02 2.046e+02 2.246e+02 3.044e+02, threshold=4.092e+02, percent-clipped=0.0 2023-10-13 13:37:30,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1398161.3333333333, ans=10.0 2023-10-13 13:37:36,044 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.92 vs. limit=15.0 2023-10-13 13:37:49,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1398254.6666666667, ans=0.125 2023-10-13 13:37:56,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1398301.3333333333, ans=0.0 2023-10-13 13:38:02,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1398301.3333333333, ans=0.125 2023-10-13 13:38:22,710 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.24 vs. limit=15.0 2023-10-13 13:38:25,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1398394.6666666667, ans=0.0 2023-10-13 13:38:36,871 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:38:49,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1398488.0, ans=0.1 2023-10-13 13:39:08,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1398581.3333333333, ans=0.1 2023-10-13 13:39:20,680 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.727e+02 1.864e+02 2.167e+02 2.849e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-13 13:39:29,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1398628.0, ans=0.0 2023-10-13 13:39:59,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1398768.0, ans=0.2 2023-10-13 13:40:12,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1398814.6666666667, ans=0.2 2023-10-13 13:40:14,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1398814.6666666667, ans=0.125 2023-10-13 13:40:17,185 INFO [train.py:1031] (0/4) Epoch 22, batch 13000, loss[loss=0.1954, simple_loss=0.2873, pruned_loss=0.0517, over 16905.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2808, pruned_loss=0.04811, over 32769009.50 frames. ], batch size: 77, lr: 1.55e-03, grad_scale: 16.0 2023-10-13 13:40:25,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1398861.3333333333, ans=0.1 2023-10-13 13:40:49,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1398954.6666666667, ans=0.125 2023-10-13 13:40:49,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1398954.6666666667, ans=0.125 2023-10-13 13:40:57,208 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=12.0 2023-10-13 13:41:01,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1399001.3333333333, ans=0.0 2023-10-13 13:41:10,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1399001.3333333333, ans=0.2 2023-10-13 13:41:23,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1399048.0, ans=0.5 2023-10-13 13:41:30,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.414e+02 1.800e+02 1.954e+02 2.071e+02 3.098e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-13 13:41:36,977 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.58 vs. limit=12.0 2023-10-13 13:41:38,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1399094.6666666667, ans=0.125 2023-10-13 13:41:55,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1399188.0, ans=0.95 2023-10-13 13:41:58,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1399188.0, ans=0.2 2023-10-13 13:42:06,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1399234.6666666667, ans=0.125 2023-10-13 13:42:41,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1399374.6666666667, ans=0.1 2023-10-13 13:42:46,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1399374.6666666667, ans=0.125 2023-10-13 13:42:52,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1399421.3333333333, ans=0.0 2023-10-13 13:42:54,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-10-13 13:43:02,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1399421.3333333333, ans=0.0 2023-10-13 13:43:05,114 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.75 vs. limit=15.0 2023-10-13 13:43:27,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1399561.3333333333, ans=0.125 2023-10-13 13:43:30,751 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.806e+02 1.982e+02 2.182e+02 3.367e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-13 13:43:32,587 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-10-13 13:43:33,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1399561.3333333333, ans=0.0 2023-10-13 13:43:35,525 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.50 vs. limit=15.0 2023-10-13 13:44:07,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1399701.3333333333, ans=0.125 2023-10-13 13:44:32,200 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-10-13 13:44:46,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1399794.6666666667, ans=0.1 2023-10-13 13:44:52,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1399841.3333333333, ans=0.07 2023-10-13 13:44:57,349 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.31 vs. limit=10.0 2023-10-13 13:45:04,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1399888.0, ans=0.125 2023-10-13 13:45:37,448 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.755e+02 1.956e+02 2.155e+02 2.813e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-13 13:45:38,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1400028.0, ans=0.125 2023-10-13 13:46:08,809 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:46:12,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1400168.0, ans=0.125 2023-10-13 13:46:15,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1400168.0, ans=0.1 2023-10-13 13:46:40,829 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-10-13 13:46:41,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1400261.3333333333, ans=0.09899494936611666 2023-10-13 13:46:55,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1400308.0, ans=0.0 2023-10-13 13:46:58,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1400354.6666666667, ans=0.0 2023-10-13 13:46:58,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1400354.6666666667, ans=0.0 2023-10-13 13:47:15,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1400401.3333333333, ans=0.1 2023-10-13 13:47:35,517 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.785e+02 1.961e+02 2.123e+02 2.663e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-13 13:48:14,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1400634.6666666667, ans=0.035 2023-10-13 13:49:07,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.21 vs. limit=15.0 2023-10-13 13:49:28,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1400961.3333333333, ans=0.2 2023-10-13 13:49:30,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.708e+02 1.866e+02 2.086e+02 2.738e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-13 13:49:48,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1401008.0, ans=0.0 2023-10-13 13:49:48,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1401008.0, ans=0.1 2023-10-13 13:49:49,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1401008.0, ans=0.125 2023-10-13 13:49:52,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1401054.6666666667, ans=0.125 2023-10-13 13:50:08,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1401101.3333333333, ans=0.125 2023-10-13 13:50:23,986 INFO [train.py:1031] (0/4) Epoch 22, batch 13500, loss[loss=0.1922, simple_loss=0.2824, pruned_loss=0.05101, over 16893.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2802, pruned_loss=0.04807, over 32768892.50 frames. ], batch size: 165, lr: 1.55e-03, grad_scale: 32.0 2023-10-13 13:50:26,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1401194.6666666667, ans=0.0 2023-10-13 13:50:37,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1401241.3333333333, ans=0.125 2023-10-13 13:50:44,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1401288.0, ans=0.125 2023-10-13 13:50:46,627 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.91 vs. limit=15.0 2023-10-13 13:50:49,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1401288.0, ans=0.5 2023-10-13 13:50:51,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1401288.0, ans=0.0 2023-10-13 13:50:53,327 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.07 vs. limit=12.0 2023-10-13 13:51:07,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1401334.6666666667, ans=0.125 2023-10-13 13:51:09,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1401334.6666666667, ans=0.0 2023-10-13 13:51:22,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.796e+02 1.980e+02 2.159e+02 2.705e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-13 13:51:31,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.78 vs. limit=15.0 2023-10-13 13:51:38,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1401474.6666666667, ans=0.125 2023-10-13 13:51:48,253 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.21 vs. limit=15.0 2023-10-13 13:52:32,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1401708.0, ans=0.125 2023-10-13 13:52:34,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1401708.0, ans=0.2 2023-10-13 13:53:04,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1401848.0, ans=0.125 2023-10-13 13:53:12,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.755e+02 1.915e+02 2.199e+02 3.264e+02, threshold=3.831e+02, percent-clipped=0.0 2023-10-13 13:53:14,413 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 13:53:18,648 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-22.pt 2023-10-13 13:53:53,229 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.56 vs. limit=15.0 2023-10-13 13:53:56,062 INFO [train.py:1031] (0/4) Epoch 23, batch 0, loss[loss=0.1756, simple_loss=0.26, pruned_loss=0.04558, over 15349.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.26, pruned_loss=0.04558, over 15349.00 frames. ], batch size: 35, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 13:53:56,064 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-13 13:54:05,486 INFO [train.py:1063] (0/4) Epoch 23, validation: loss=0.2135, simple_loss=0.3003, pruned_loss=0.06333, over 1020973.00 frames. 2023-10-13 13:54:05,488 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-13 13:54:08,528 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.35 vs. limit=15.0 2023-10-13 13:54:29,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1401964.6666666667, ans=0.125 2023-10-13 13:54:50,346 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-10-13 13:55:23,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1402198.0, ans=0.0 2023-10-13 13:55:24,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1402198.0, ans=0.125 2023-10-13 13:55:30,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1402198.0, ans=0.125 2023-10-13 13:55:33,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1402198.0, ans=0.125 2023-10-13 13:55:37,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1402244.6666666667, ans=0.0 2023-10-13 13:55:51,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.00 vs. limit=22.5 2023-10-13 13:56:05,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.713e+02 1.876e+02 2.108e+02 3.132e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-13 13:56:13,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1402384.6666666667, ans=0.05 2023-10-13 13:56:20,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1402431.3333333333, ans=0.125 2023-10-13 13:56:21,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1402431.3333333333, ans=0.125 2023-10-13 13:56:58,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1402571.3333333333, ans=0.125 2023-10-13 13:57:14,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1402618.0, ans=0.2 2023-10-13 13:57:14,927 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.15 vs. limit=15.0 2023-10-13 13:57:42,175 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-10-13 13:57:44,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1402758.0, ans=0.125 2023-10-13 13:57:44,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1402758.0, ans=0.125 2023-10-13 13:57:57,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.828e+02 1.987e+02 2.140e+02 3.875e+02, threshold=3.974e+02, percent-clipped=1.0 2023-10-13 13:58:02,268 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.14 vs. limit=22.5 2023-10-13 13:58:03,677 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.64 vs. limit=15.0 2023-10-13 13:58:27,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1402944.6666666667, ans=0.2 2023-10-13 13:58:30,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-10-13 13:58:36,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1402991.3333333333, ans=0.0 2023-10-13 13:58:49,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1403038.0, ans=0.0 2023-10-13 13:58:53,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1403038.0, ans=0.0 2023-10-13 13:59:09,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1403084.6666666667, ans=0.125 2023-10-13 13:59:17,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1403131.3333333333, ans=0.0 2023-10-13 13:59:36,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1403224.6666666667, ans=0.2 2023-10-13 13:59:40,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1403224.6666666667, ans=0.2 2023-10-13 13:59:43,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1403271.3333333333, ans=0.125 2023-10-13 13:59:53,263 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.756e+02 1.940e+02 2.111e+02 2.645e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-13 14:00:00,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1403318.0, ans=0.125 2023-10-13 14:00:08,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1403364.6666666667, ans=0.0 2023-10-13 14:00:10,186 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:00:17,318 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.43 vs. limit=15.0 2023-10-13 14:00:22,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1403411.3333333333, ans=0.2 2023-10-13 14:00:34,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1403458.0, ans=0.125 2023-10-13 14:00:37,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1403458.0, ans=0.1 2023-10-13 14:00:38,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-10-13 14:00:57,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1403551.3333333333, ans=0.1 2023-10-13 14:01:01,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1403551.3333333333, ans=0.125 2023-10-13 14:01:17,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1403644.6666666667, ans=0.125 2023-10-13 14:01:34,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1403691.3333333333, ans=0.2 2023-10-13 14:01:37,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1403691.3333333333, ans=0.125 2023-10-13 14:01:39,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-10-13 14:01:45,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.881e+02 2.038e+02 2.202e+02 2.818e+02, threshold=4.075e+02, percent-clipped=0.0 2023-10-13 14:01:56,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1403784.6666666667, ans=0.125 2023-10-13 14:01:58,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1403784.6666666667, ans=0.1 2023-10-13 14:02:02,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1403831.3333333333, ans=0.1 2023-10-13 14:02:14,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1403878.0, ans=0.125 2023-10-13 14:02:17,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1403878.0, ans=0.0 2023-10-13 14:02:42,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1403971.3333333333, ans=0.0 2023-10-13 14:02:54,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1404018.0, ans=0.09899494936611666 2023-10-13 14:03:11,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.92 vs. limit=15.0 2023-10-13 14:03:15,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1404111.3333333333, ans=0.125 2023-10-13 14:03:35,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1404158.0, ans=0.0 2023-10-13 14:03:36,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404158.0, ans=0.1 2023-10-13 14:03:39,488 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.01 vs. limit=15.0 2023-10-13 14:03:43,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404204.6666666667, ans=0.1 2023-10-13 14:03:48,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.807e+02 1.926e+02 2.078e+02 3.017e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-13 14:03:53,063 INFO [train.py:1031] (0/4) Epoch 23, batch 500, loss[loss=0.1563, simple_loss=0.2554, pruned_loss=0.02858, over 16841.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2789, pruned_loss=0.04727, over 7283458.02 frames. ], batch size: 72, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 14:04:15,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1404344.6666666667, ans=0.125 2023-10-13 14:04:36,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1404391.3333333333, ans=0.05 2023-10-13 14:04:43,618 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.50 vs. limit=15.0 2023-10-13 14:04:48,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1404438.0, ans=0.125 2023-10-13 14:04:51,645 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:05:00,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=12.0 2023-10-13 14:05:21,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1404578.0, ans=0.125 2023-10-13 14:05:29,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1404624.6666666667, ans=0.0 2023-10-13 14:05:38,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1404624.6666666667, ans=0.0 2023-10-13 14:05:40,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1404671.3333333333, ans=0.125 2023-10-13 14:05:48,551 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.771e+02 1.937e+02 2.125e+02 2.637e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-13 14:05:51,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404718.0, ans=0.1 2023-10-13 14:05:57,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1404718.0, ans=0.125 2023-10-13 14:05:58,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.67 vs. limit=12.0 2023-10-13 14:05:58,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1404718.0, ans=0.125 2023-10-13 14:06:04,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1404764.6666666667, ans=0.125 2023-10-13 14:06:10,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1404764.6666666667, ans=0.1 2023-10-13 14:06:15,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1404811.3333333333, ans=0.1 2023-10-13 14:06:19,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1404811.3333333333, ans=0.0 2023-10-13 14:06:46,176 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:07:00,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404998.0, ans=0.1 2023-10-13 14:07:27,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=8.34 vs. limit=22.5 2023-10-13 14:07:40,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.801e+02 2.009e+02 2.314e+02 3.512e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-13 14:07:46,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1405184.6666666667, ans=0.0 2023-10-13 14:08:01,498 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.73 vs. limit=15.0 2023-10-13 14:08:14,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1405324.6666666667, ans=0.125 2023-10-13 14:09:15,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1405558.0, ans=0.0 2023-10-13 14:09:34,874 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.728e+02 1.872e+02 2.111e+02 2.709e+02, threshold=3.745e+02, percent-clipped=0.0 2023-10-13 14:09:40,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1405651.3333333333, ans=0.125 2023-10-13 14:09:43,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1405651.3333333333, ans=0.125 2023-10-13 14:09:56,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1405698.0, ans=0.2 2023-10-13 14:09:56,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1405698.0, ans=0.1 2023-10-13 14:10:02,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-10-13 14:10:08,537 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:10:28,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.25 vs. limit=15.0 2023-10-13 14:10:47,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1405884.6666666667, ans=0.95 2023-10-13 14:10:53,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1405884.6666666667, ans=0.1 2023-10-13 14:11:36,974 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:11:38,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1406071.3333333333, ans=0.125 2023-10-13 14:11:41,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.793e+02 1.943e+02 2.050e+02 2.533e+02, threshold=3.887e+02, percent-clipped=0.0 2023-10-13 14:11:59,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1406164.6666666667, ans=0.2 2023-10-13 14:12:03,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1406164.6666666667, ans=0.125 2023-10-13 14:12:12,575 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.21 vs. limit=6.0 2023-10-13 14:12:18,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1406211.3333333333, ans=0.125 2023-10-13 14:12:19,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1406258.0, ans=0.1 2023-10-13 14:12:42,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1406304.6666666667, ans=0.1 2023-10-13 14:12:43,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1406304.6666666667, ans=0.125 2023-10-13 14:12:49,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1406351.3333333333, ans=0.125 2023-10-13 14:13:44,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.685e+02 1.837e+02 2.058e+02 2.519e+02, threshold=3.674e+02, percent-clipped=0.0 2023-10-13 14:13:45,971 INFO [train.py:1031] (0/4) Epoch 23, batch 1000, loss[loss=0.1862, simple_loss=0.281, pruned_loss=0.04568, over 16704.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2807, pruned_loss=0.04815, over 12941463.85 frames. ], batch size: 202, lr: 1.51e-03, grad_scale: 16.0 2023-10-13 14:13:49,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1406584.6666666667, ans=0.125 2023-10-13 14:14:04,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1406631.3333333333, ans=0.125 2023-10-13 14:14:06,927 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-10-13 14:14:19,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1406724.6666666667, ans=0.0 2023-10-13 14:14:36,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1406771.3333333333, ans=0.125 2023-10-13 14:15:05,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1406911.3333333333, ans=0.125 2023-10-13 14:15:14,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1406958.0, ans=0.1 2023-10-13 14:15:16,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1406958.0, ans=0.0 2023-10-13 14:15:17,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1406958.0, ans=0.125 2023-10-13 14:15:32,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.755e+02 1.983e+02 2.176e+02 2.671e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-13 14:15:32,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-10-13 14:16:02,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1407144.6666666667, ans=0.1 2023-10-13 14:16:10,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1407144.6666666667, ans=0.0 2023-10-13 14:16:11,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1407144.6666666667, ans=0.125 2023-10-13 14:16:14,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1407191.3333333333, ans=0.125 2023-10-13 14:16:21,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1407191.3333333333, ans=0.1 2023-10-13 14:16:38,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1407284.6666666667, ans=0.05 2023-10-13 14:16:39,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1407284.6666666667, ans=0.07 2023-10-13 14:17:11,530 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-13 14:17:21,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1407424.6666666667, ans=0.125 2023-10-13 14:17:22,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1407424.6666666667, ans=0.125 2023-10-13 14:17:24,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1407424.6666666667, ans=0.125 2023-10-13 14:17:25,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1407424.6666666667, ans=0.1 2023-10-13 14:17:28,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1407424.6666666667, ans=0.1 2023-10-13 14:17:41,377 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.722e+02 1.946e+02 2.218e+02 2.802e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 14:17:46,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1407518.0, ans=0.2 2023-10-13 14:17:52,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1407518.0, ans=0.0 2023-10-13 14:18:06,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1407564.6666666667, ans=0.2 2023-10-13 14:18:35,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.85 vs. limit=15.0 2023-10-13 14:18:47,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1407751.3333333333, ans=0.5 2023-10-13 14:18:52,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1407798.0, ans=0.125 2023-10-13 14:19:24,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1407938.0, ans=0.125 2023-10-13 14:19:27,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1407938.0, ans=0.0 2023-10-13 14:19:28,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1407938.0, ans=0.0 2023-10-13 14:19:31,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1407938.0, ans=0.5 2023-10-13 14:19:34,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.738e+02 1.885e+02 2.137e+02 3.505e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-13 14:19:34,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1407938.0, ans=0.0 2023-10-13 14:19:37,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1407984.6666666667, ans=0.125 2023-10-13 14:19:55,671 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.86 vs. limit=10.0 2023-10-13 14:20:01,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1408078.0, ans=0.0 2023-10-13 14:20:32,495 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.79 vs. limit=10.0 2023-10-13 14:20:40,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1408264.6666666667, ans=0.125 2023-10-13 14:20:42,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1408264.6666666667, ans=0.2 2023-10-13 14:20:53,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1408311.3333333333, ans=0.1 2023-10-13 14:21:02,989 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-13 14:21:12,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=1408358.0, ans=0.2 2023-10-13 14:21:24,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.766e+02 1.943e+02 2.105e+02 3.134e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-13 14:21:32,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1408451.3333333333, ans=0.125 2023-10-13 14:21:38,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.60 vs. limit=15.0 2023-10-13 14:21:43,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1408498.0, ans=0.1 2023-10-13 14:21:52,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1408544.6666666667, ans=0.125 2023-10-13 14:22:02,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1408591.3333333333, ans=0.125 2023-10-13 14:22:28,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1408684.6666666667, ans=15.0 2023-10-13 14:22:45,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1408731.3333333333, ans=0.0 2023-10-13 14:22:54,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1408778.0, ans=0.125 2023-10-13 14:22:57,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1408778.0, ans=0.125 2023-10-13 14:22:58,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1408778.0, ans=0.5 2023-10-13 14:23:09,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1408824.6666666667, ans=0.125 2023-10-13 14:23:13,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.35 vs. limit=15.0 2023-10-13 14:23:19,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1408871.3333333333, ans=0.0 2023-10-13 14:23:29,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.402e+02 1.687e+02 1.879e+02 2.091e+02 2.743e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-13 14:23:31,001 INFO [train.py:1031] (0/4) Epoch 23, batch 1500, loss[loss=0.176, simple_loss=0.2665, pruned_loss=0.04277, over 15380.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2794, pruned_loss=0.04755, over 17363714.13 frames. ], batch size: 35, lr: 1.51e-03, grad_scale: 16.0 2023-10-13 14:23:33,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1408918.0, ans=0.1 2023-10-13 14:23:33,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.59 vs. limit=15.0 2023-10-13 14:23:47,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1408964.6666666667, ans=0.5 2023-10-13 14:23:51,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1408964.6666666667, ans=0.0 2023-10-13 14:23:56,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1409011.3333333333, ans=0.0 2023-10-13 14:24:14,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1409058.0, ans=0.0 2023-10-13 14:24:20,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-10-13 14:24:23,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1409104.6666666667, ans=0.2 2023-10-13 14:24:24,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1409104.6666666667, ans=0.0 2023-10-13 14:24:32,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1409151.3333333333, ans=0.07 2023-10-13 14:24:46,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.70 vs. limit=12.0 2023-10-13 14:25:01,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1409244.6666666667, ans=0.0 2023-10-13 14:25:07,582 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.80 vs. limit=15.0 2023-10-13 14:25:21,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1409338.0, ans=0.0 2023-10-13 14:25:30,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.734e+02 1.871e+02 2.078e+02 2.945e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-13 14:25:44,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1409431.3333333333, ans=0.125 2023-10-13 14:25:50,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1409431.3333333333, ans=0.2 2023-10-13 14:25:56,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1409478.0, ans=0.0 2023-10-13 14:26:01,655 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:26:19,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1409524.6666666667, ans=0.07 2023-10-13 14:26:30,322 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.45 vs. limit=15.0 2023-10-13 14:26:33,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1409618.0, ans=0.125 2023-10-13 14:26:33,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1409618.0, ans=0.125 2023-10-13 14:26:49,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1409664.6666666667, ans=0.1 2023-10-13 14:26:50,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1409664.6666666667, ans=0.125 2023-10-13 14:26:56,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1409664.6666666667, ans=0.0 2023-10-13 14:27:07,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1409758.0, ans=0.07 2023-10-13 14:27:11,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409758.0, ans=0.1 2023-10-13 14:27:14,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1409758.0, ans=0.125 2023-10-13 14:27:29,266 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.858e+02 2.034e+02 2.316e+02 3.427e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-13 14:27:34,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1409851.3333333333, ans=0.0 2023-10-13 14:27:42,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.45 vs. limit=22.5 2023-10-13 14:28:11,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1409991.3333333333, ans=0.2 2023-10-13 14:28:19,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1410038.0, ans=0.0 2023-10-13 14:28:19,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1410038.0, ans=0.1 2023-10-13 14:28:34,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1410084.6666666667, ans=0.125 2023-10-13 14:28:37,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1410131.3333333333, ans=0.0 2023-10-13 14:28:39,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1410131.3333333333, ans=0.1 2023-10-13 14:28:51,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1410178.0, ans=0.125 2023-10-13 14:28:53,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1410178.0, ans=0.125 2023-10-13 14:29:23,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.61 vs. limit=15.0 2023-10-13 14:29:27,309 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.797e+02 1.978e+02 2.189e+02 3.307e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-13 14:29:42,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1410364.6666666667, ans=15.0 2023-10-13 14:29:44,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1410364.6666666667, ans=0.05 2023-10-13 14:30:02,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1410458.0, ans=0.125 2023-10-13 14:30:03,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1410458.0, ans=0.125 2023-10-13 14:30:12,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1410504.6666666667, ans=0.125 2023-10-13 14:30:20,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1410504.6666666667, ans=0.025 2023-10-13 14:30:50,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1410644.6666666667, ans=0.0 2023-10-13 14:31:00,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-10-13 14:31:00,307 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.73 vs. limit=15.0 2023-10-13 14:31:02,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1410691.3333333333, ans=0.125 2023-10-13 14:31:02,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.31 vs. limit=15.0 2023-10-13 14:31:06,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1410691.3333333333, ans=0.5 2023-10-13 14:31:19,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.721e+02 1.939e+02 2.209e+02 3.132e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-13 14:31:26,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1410784.6666666667, ans=0.125 2023-10-13 14:31:29,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1410784.6666666667, ans=0.0 2023-10-13 14:31:45,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1410878.0, ans=0.0 2023-10-13 14:31:45,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1410878.0, ans=0.125 2023-10-13 14:32:20,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1410971.3333333333, ans=0.1 2023-10-13 14:32:30,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1411018.0, ans=0.125 2023-10-13 14:32:39,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1411064.6666666667, ans=0.2 2023-10-13 14:32:49,506 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:32:51,397 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.67 vs. limit=6.0 2023-10-13 14:32:57,225 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.10 vs. limit=15.0 2023-10-13 14:33:14,733 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.91 vs. limit=15.0 2023-10-13 14:33:18,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1411204.6666666667, ans=0.95 2023-10-13 14:33:31,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.768e+02 1.960e+02 2.183e+02 2.979e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-13 14:33:31,388 INFO [train.py:1031] (0/4) Epoch 23, batch 2000, loss[loss=0.1901, simple_loss=0.2862, pruned_loss=0.047, over 16970.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2798, pruned_loss=0.04755, over 20782004.53 frames. ], batch size: 123, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 14:33:36,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1411251.3333333333, ans=0.025 2023-10-13 14:33:50,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1411298.0, ans=0.125 2023-10-13 14:33:51,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1411298.0, ans=0.0 2023-10-13 14:34:03,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1411298.0, ans=0.02 2023-10-13 14:34:58,196 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:35:11,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1411578.0, ans=0.07 2023-10-13 14:35:24,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1411624.6666666667, ans=0.0 2023-10-13 14:35:37,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1411671.3333333333, ans=0.0 2023-10-13 14:35:53,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.763e+02 1.946e+02 2.151e+02 2.790e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 14:35:54,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1411718.0, ans=0.0 2023-10-13 14:36:14,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1411764.6666666667, ans=0.125 2023-10-13 14:36:39,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1411811.3333333333, ans=0.125 2023-10-13 14:36:48,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1411858.0, ans=0.1 2023-10-13 14:36:54,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1411858.0, ans=0.0 2023-10-13 14:36:59,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1411904.6666666667, ans=0.0 2023-10-13 14:37:04,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.07 vs. limit=10.0 2023-10-13 14:37:08,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1411904.6666666667, ans=0.125 2023-10-13 14:37:35,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1411998.0, ans=0.04949747468305833 2023-10-13 14:37:40,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1412044.6666666667, ans=0.125 2023-10-13 14:37:46,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1412044.6666666667, ans=0.0 2023-10-13 14:38:01,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1412138.0, ans=0.125 2023-10-13 14:38:10,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1412138.0, ans=0.125 2023-10-13 14:38:14,013 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.852e+02 1.976e+02 2.212e+02 3.114e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-13 14:38:15,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.37 vs. limit=10.0 2023-10-13 14:38:46,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1412278.0, ans=0.2 2023-10-13 14:38:47,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1412278.0, ans=0.125 2023-10-13 14:39:30,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1412464.6666666667, ans=0.2 2023-10-13 14:39:30,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-10-13 14:39:40,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1412511.3333333333, ans=0.125 2023-10-13 14:39:51,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1412558.0, ans=0.125 2023-10-13 14:39:57,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1412558.0, ans=0.125 2023-10-13 14:40:01,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1412558.0, ans=0.035 2023-10-13 14:40:14,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.788e+02 1.970e+02 2.119e+02 3.374e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-13 14:40:21,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1412651.3333333333, ans=0.1 2023-10-13 14:40:28,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1412698.0, ans=0.125 2023-10-13 14:40:29,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1412698.0, ans=0.0 2023-10-13 14:40:37,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1412744.6666666667, ans=0.0 2023-10-13 14:40:41,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1412744.6666666667, ans=0.125 2023-10-13 14:40:58,829 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 14:41:08,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1412884.6666666667, ans=0.0 2023-10-13 14:41:16,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1412884.6666666667, ans=0.0 2023-10-13 14:41:21,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1412931.3333333333, ans=0.09899494936611666 2023-10-13 14:41:38,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1412978.0, ans=0.0 2023-10-13 14:41:50,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1413024.6666666667, ans=0.125 2023-10-13 14:42:05,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.816e+02 2.001e+02 2.316e+02 2.970e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-13 14:42:28,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1413211.3333333333, ans=15.0 2023-10-13 14:42:40,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1413258.0, ans=0.0 2023-10-13 14:43:02,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1413351.3333333333, ans=0.1 2023-10-13 14:43:11,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1413398.0, ans=0.07 2023-10-13 14:43:20,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1413444.6666666667, ans=0.2 2023-10-13 14:43:27,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=1413444.6666666667, ans=10.0 2023-10-13 14:43:35,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1413491.3333333333, ans=0.04949747468305833 2023-10-13 14:43:35,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-10-13 14:43:51,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1413538.0, ans=0.125 2023-10-13 14:43:52,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1413538.0, ans=0.09899494936611666 2023-10-13 14:43:54,017 INFO [train.py:1031] (0/4) Epoch 23, batch 2500, loss[loss=0.191, simple_loss=0.2816, pruned_loss=0.05021, over 16862.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2801, pruned_loss=0.04808, over 23413004.81 frames. ], batch size: 130, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 14:43:54,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.813e+02 1.957e+02 2.130e+02 3.349e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-13 14:43:55,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1413584.6666666667, ans=0.1 2023-10-13 14:44:05,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1413631.3333333333, ans=0.125 2023-10-13 14:44:06,002 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.34 vs. limit=10.0 2023-10-13 14:44:27,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1413724.6666666667, ans=0.04949747468305833 2023-10-13 14:44:29,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1413724.6666666667, ans=0.125 2023-10-13 14:44:29,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1413724.6666666667, ans=0.1 2023-10-13 14:44:42,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1413771.3333333333, ans=0.1 2023-10-13 14:45:38,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1414004.6666666667, ans=0.125 2023-10-13 14:45:39,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1414004.6666666667, ans=0.1 2023-10-13 14:45:41,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1414051.3333333333, ans=0.125 2023-10-13 14:45:43,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.827e+02 1.995e+02 2.238e+02 2.966e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-13 14:45:47,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1414051.3333333333, ans=0.125 2023-10-13 14:45:53,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1414098.0, ans=0.025 2023-10-13 14:45:53,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.13 vs. limit=15.0 2023-10-13 14:45:54,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1414098.0, ans=0.1 2023-10-13 14:46:09,771 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.73 vs. limit=10.0 2023-10-13 14:46:14,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1414144.6666666667, ans=0.125 2023-10-13 14:46:14,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1414144.6666666667, ans=0.2 2023-10-13 14:46:30,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1414238.0, ans=0.125 2023-10-13 14:46:37,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1414238.0, ans=0.0 2023-10-13 14:46:50,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1414331.3333333333, ans=0.0 2023-10-13 14:46:51,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=15.0 2023-10-13 14:47:06,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1414378.0, ans=0.0 2023-10-13 14:47:36,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.791e+02 1.910e+02 2.109e+02 2.876e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-13 14:47:41,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-10-13 14:47:48,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1414564.6666666667, ans=0.2 2023-10-13 14:47:51,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1414564.6666666667, ans=0.125 2023-10-13 14:47:54,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1414564.6666666667, ans=0.0 2023-10-13 14:47:54,849 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.94 vs. limit=6.0 2023-10-13 14:48:01,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1414611.3333333333, ans=0.1 2023-10-13 14:48:18,350 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.75 vs. limit=22.5 2023-10-13 14:48:36,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.71 vs. limit=15.0 2023-10-13 14:48:39,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1414751.3333333333, ans=0.0 2023-10-13 14:48:48,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1414751.3333333333, ans=0.125 2023-10-13 14:49:22,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.03 vs. limit=15.0 2023-10-13 14:49:32,185 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-10-13 14:49:32,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1414938.0, ans=0.1 2023-10-13 14:49:33,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1414938.0, ans=0.125 2023-10-13 14:49:44,038 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.750e+02 1.869e+02 2.071e+02 2.548e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-13 14:50:01,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=1415031.3333333333, ans=0.025 2023-10-13 14:50:02,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.35 vs. limit=10.0 2023-10-13 14:50:06,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1415078.0, ans=0.09899494936611666 2023-10-13 14:50:12,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1415078.0, ans=0.125 2023-10-13 14:50:20,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1415124.6666666667, ans=0.2 2023-10-13 14:50:48,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1415218.0, ans=0.125 2023-10-13 14:51:08,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1415264.6666666667, ans=0.1 2023-10-13 14:51:15,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.10 vs. limit=15.0 2023-10-13 14:51:36,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1415358.0, ans=0.125 2023-10-13 14:51:55,174 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.787e+02 1.966e+02 2.219e+02 3.154e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-13 14:51:57,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.30 vs. limit=22.5 2023-10-13 14:51:58,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.90 vs. limit=15.0 2023-10-13 14:52:17,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1415544.6666666667, ans=0.0 2023-10-13 14:52:30,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1415591.3333333333, ans=0.1 2023-10-13 14:52:35,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1415591.3333333333, ans=0.0 2023-10-13 14:52:44,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1415638.0, ans=0.2 2023-10-13 14:52:53,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1415684.6666666667, ans=0.0 2023-10-13 14:53:04,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1415731.3333333333, ans=0.125 2023-10-13 14:53:07,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1415731.3333333333, ans=0.0 2023-10-13 14:53:13,237 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-10-13 14:53:14,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1415778.0, ans=0.2 2023-10-13 14:53:28,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1415824.6666666667, ans=0.125 2023-10-13 14:53:33,024 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-10-13 14:53:45,565 INFO [train.py:1031] (0/4) Epoch 23, batch 3000, loss[loss=0.1719, simple_loss=0.2633, pruned_loss=0.0402, over 15279.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2795, pruned_loss=0.04799, over 25527329.50 frames. ], batch size: 35, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 14:53:46,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.789e+02 1.927e+02 2.117e+02 2.963e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-13 14:53:47,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1415918.0, ans=0.125 2023-10-13 14:53:51,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1415918.0, ans=0.125 2023-10-13 14:53:56,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1415964.6666666667, ans=0.0 2023-10-13 14:54:13,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1416011.3333333333, ans=0.1 2023-10-13 14:54:24,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1416058.0, ans=0.0 2023-10-13 14:54:32,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1416104.6666666667, ans=0.1 2023-10-13 14:54:47,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1416151.3333333333, ans=0.2 2023-10-13 14:54:48,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1416151.3333333333, ans=0.1 2023-10-13 14:54:53,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1416198.0, ans=0.125 2023-10-13 14:55:12,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1416244.6666666667, ans=0.07 2023-10-13 14:55:24,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1416291.3333333333, ans=0.2 2023-10-13 14:55:38,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1416338.0, ans=0.125 2023-10-13 14:55:43,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.951e+02 2.200e+02 2.557e+02 3.361e+02, threshold=4.399e+02, percent-clipped=0.0 2023-10-13 14:55:54,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1416384.6666666667, ans=0.0 2023-10-13 14:55:57,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1416431.3333333333, ans=0.125 2023-10-13 14:56:51,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1416618.0, ans=0.2 2023-10-13 14:56:55,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1416664.6666666667, ans=0.0 2023-10-13 14:57:02,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1416664.6666666667, ans=0.0 2023-10-13 14:57:02,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.04 vs. limit=10.0 2023-10-13 14:57:06,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1416711.3333333333, ans=0.125 2023-10-13 14:57:36,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.744e+02 1.939e+02 2.110e+02 2.800e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-13 14:57:45,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1416851.3333333333, ans=0.125 2023-10-13 14:57:56,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1416898.0, ans=0.125 2023-10-13 14:58:30,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.91 vs. limit=15.0 2023-10-13 14:58:30,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1417038.0, ans=0.125 2023-10-13 14:58:34,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.77 vs. limit=6.0 2023-10-13 14:58:41,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1417084.6666666667, ans=0.0 2023-10-13 14:58:53,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.75 vs. limit=12.0 2023-10-13 14:58:59,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1417131.3333333333, ans=0.125 2023-10-13 14:59:07,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1417131.3333333333, ans=0.2 2023-10-13 14:59:11,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=1417178.0, ans=12.0 2023-10-13 14:59:19,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1417178.0, ans=0.04949747468305833 2023-10-13 14:59:39,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1417271.3333333333, ans=0.1 2023-10-13 14:59:46,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.821e+02 1.994e+02 2.176e+02 2.867e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-13 14:59:56,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1417364.6666666667, ans=0.125 2023-10-13 15:00:19,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1417458.0, ans=0.125 2023-10-13 15:00:26,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1417458.0, ans=0.0 2023-10-13 15:00:41,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1417551.3333333333, ans=0.1 2023-10-13 15:00:54,244 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:00:57,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1417598.0, ans=0.125 2023-10-13 15:01:22,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1417691.3333333333, ans=0.125 2023-10-13 15:01:31,972 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.87 vs. limit=22.5 2023-10-13 15:01:32,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1417738.0, ans=0.0 2023-10-13 15:01:43,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.801e+02 1.965e+02 2.181e+02 3.016e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-13 15:01:45,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1417784.6666666667, ans=0.1 2023-10-13 15:01:47,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1417784.6666666667, ans=0.125 2023-10-13 15:01:50,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1417784.6666666667, ans=0.1 2023-10-13 15:01:51,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1417784.6666666667, ans=0.0 2023-10-13 15:01:57,460 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:02:15,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1417878.0, ans=0.2 2023-10-13 15:02:17,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1417924.6666666667, ans=0.125 2023-10-13 15:02:19,551 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:02:46,007 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:03:02,023 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.39 vs. limit=22.5 2023-10-13 15:03:12,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1418158.0, ans=0.0 2023-10-13 15:03:27,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1418204.6666666667, ans=0.125 2023-10-13 15:03:35,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1418204.6666666667, ans=0.04949747468305833 2023-10-13 15:03:36,934 INFO [train.py:1031] (0/4) Epoch 23, batch 3500, loss[loss=0.2035, simple_loss=0.2926, pruned_loss=0.05715, over 16491.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2797, pruned_loss=0.04825, over 27128668.28 frames. ], batch size: 267, lr: 1.51e-03, grad_scale: 32.0 2023-10-13 15:03:39,013 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.867e+02 2.006e+02 2.213e+02 3.297e+02, threshold=4.011e+02, percent-clipped=0.0 2023-10-13 15:03:41,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1418251.3333333333, ans=0.0 2023-10-13 15:03:58,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1418298.0, ans=0.1 2023-10-13 15:04:01,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1418344.6666666667, ans=0.125 2023-10-13 15:04:02,990 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-10-13 15:04:07,987 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-10-13 15:04:21,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1418391.3333333333, ans=0.125 2023-10-13 15:04:21,882 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-10-13 15:04:27,968 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-10-13 15:04:43,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1418484.6666666667, ans=0.05 2023-10-13 15:04:52,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1418531.3333333333, ans=0.0 2023-10-13 15:05:17,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1418578.0, ans=0.125 2023-10-13 15:05:18,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1418624.6666666667, ans=10.0 2023-10-13 15:05:19,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1418624.6666666667, ans=0.1 2023-10-13 15:05:28,696 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-304000.pt 2023-10-13 15:05:46,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.771e+02 1.925e+02 2.185e+02 3.228e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-13 15:06:26,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1418858.0, ans=0.0 2023-10-13 15:06:42,037 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.75 vs. limit=15.0 2023-10-13 15:07:10,904 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.74 vs. limit=15.0 2023-10-13 15:07:12,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1419044.6666666667, ans=0.125 2023-10-13 15:07:16,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419044.6666666667, ans=0.1 2023-10-13 15:07:43,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1419184.6666666667, ans=0.0 2023-10-13 15:07:44,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.705e+02 1.877e+02 2.098e+02 2.446e+02, threshold=3.754e+02, percent-clipped=0.0 2023-10-13 15:07:49,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1419184.6666666667, ans=0.125 2023-10-13 15:07:57,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1419231.3333333333, ans=0.125 2023-10-13 15:07:57,900 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.06 vs. limit=15.0 2023-10-13 15:08:30,966 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.71 vs. limit=22.5 2023-10-13 15:08:40,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1419371.3333333333, ans=0.0 2023-10-13 15:08:43,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.47 vs. limit=12.0 2023-10-13 15:08:44,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1419418.0, ans=0.0 2023-10-13 15:08:44,827 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.07 vs. limit=15.0 2023-10-13 15:08:49,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1419418.0, ans=0.0 2023-10-13 15:09:07,796 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=22.5 2023-10-13 15:09:17,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1419511.3333333333, ans=0.2 2023-10-13 15:09:21,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1419558.0, ans=0.125 2023-10-13 15:09:28,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2023-10-13 15:09:30,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1419558.0, ans=22.5 2023-10-13 15:09:31,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1419604.6666666667, ans=0.125 2023-10-13 15:09:39,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=15.0 2023-10-13 15:09:46,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.758e+02 1.871e+02 2.118e+02 2.968e+02, threshold=3.743e+02, percent-clipped=0.0 2023-10-13 15:10:15,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1419744.6666666667, ans=0.125 2023-10-13 15:10:31,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1419838.0, ans=0.1 2023-10-13 15:10:38,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-10-13 15:11:24,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1420024.6666666667, ans=0.125 2023-10-13 15:11:29,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1420071.3333333333, ans=0.125 2023-10-13 15:11:32,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1420071.3333333333, ans=0.125 2023-10-13 15:11:33,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-10-13 15:11:42,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.737e+02 1.956e+02 2.141e+02 2.948e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-13 15:11:53,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1420164.6666666667, ans=0.125 2023-10-13 15:12:19,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1420258.0, ans=0.125 2023-10-13 15:12:19,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=22.5 2023-10-13 15:12:20,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1420258.0, ans=0.05 2023-10-13 15:12:31,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1420304.6666666667, ans=0.1 2023-10-13 15:12:39,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1420351.3333333333, ans=0.0 2023-10-13 15:12:42,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1420351.3333333333, ans=0.0 2023-10-13 15:12:45,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-10-13 15:12:50,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1420398.0, ans=0.0 2023-10-13 15:12:53,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1420398.0, ans=0.125 2023-10-13 15:12:53,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1420398.0, ans=0.0 2023-10-13 15:13:10,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1420491.3333333333, ans=0.125 2023-10-13 15:13:29,917 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.36 vs. limit=10.0 2023-10-13 15:13:32,704 INFO [train.py:1031] (0/4) Epoch 23, batch 4000, loss[loss=0.1831, simple_loss=0.2777, pruned_loss=0.04425, over 16593.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2795, pruned_loss=0.04846, over 28375464.54 frames. ], batch size: 66, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 15:13:35,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.782e+02 1.996e+02 2.148e+02 3.690e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 15:13:44,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1420584.6666666667, ans=0.0 2023-10-13 15:13:52,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1420631.3333333333, ans=0.2 2023-10-13 15:13:56,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1420631.3333333333, ans=0.125 2023-10-13 15:13:57,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1420678.0, ans=0.2 2023-10-13 15:14:07,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1420678.0, ans=0.125 2023-10-13 15:14:14,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1420724.6666666667, ans=0.125 2023-10-13 15:14:14,541 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.93 vs. limit=10.0 2023-10-13 15:14:18,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1420724.6666666667, ans=0.0 2023-10-13 15:14:20,892 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.28 vs. limit=22.5 2023-10-13 15:14:36,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1420818.0, ans=0.125 2023-10-13 15:14:37,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1420818.0, ans=0.125 2023-10-13 15:14:45,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1420864.6666666667, ans=0.1 2023-10-13 15:14:48,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1420864.6666666667, ans=0.2 2023-10-13 15:14:49,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1420864.6666666667, ans=0.035 2023-10-13 15:15:16,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1420958.0, ans=6.0 2023-10-13 15:15:20,648 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-10-13 15:15:23,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1421004.6666666667, ans=0.125 2023-10-13 15:15:31,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1421051.3333333333, ans=0.2 2023-10-13 15:15:32,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.815e+02 1.954e+02 2.238e+02 2.920e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-13 15:15:39,692 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.11 vs. limit=15.0 2023-10-13 15:15:41,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1421098.0, ans=0.125 2023-10-13 15:16:26,276 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:16:56,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1421378.0, ans=0.125 2023-10-13 15:17:40,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.828e+02 2.022e+02 2.247e+02 3.126e+02, threshold=4.045e+02, percent-clipped=0.0 2023-10-13 15:17:46,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1421518.0, ans=0.0 2023-10-13 15:18:42,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1421751.3333333333, ans=0.2 2023-10-13 15:19:21,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1421891.3333333333, ans=0.035 2023-10-13 15:19:37,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1421938.0, ans=0.125 2023-10-13 15:19:41,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1421984.6666666667, ans=0.125 2023-10-13 15:19:42,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.775e+02 1.955e+02 2.168e+02 3.437e+02, threshold=3.910e+02, percent-clipped=0.0 2023-10-13 15:19:58,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1422031.3333333333, ans=0.0 2023-10-13 15:19:59,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1422031.3333333333, ans=0.0 2023-10-13 15:20:04,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.35 vs. limit=15.0 2023-10-13 15:20:10,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1422078.0, ans=0.125 2023-10-13 15:20:12,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.00 vs. limit=15.0 2023-10-13 15:20:17,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1422124.6666666667, ans=0.1 2023-10-13 15:20:21,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1422124.6666666667, ans=0.125 2023-10-13 15:20:30,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1422171.3333333333, ans=0.0 2023-10-13 15:20:34,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1422171.3333333333, ans=0.0 2023-10-13 15:20:38,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1422171.3333333333, ans=0.0 2023-10-13 15:20:39,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.84 vs. limit=10.0 2023-10-13 15:20:43,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1422218.0, ans=0.1 2023-10-13 15:20:51,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1422264.6666666667, ans=0.1 2023-10-13 15:20:53,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.33 vs. limit=15.0 2023-10-13 15:20:57,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1422264.6666666667, ans=0.0 2023-10-13 15:21:09,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1422311.3333333333, ans=0.125 2023-10-13 15:21:29,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1422358.0, ans=0.125 2023-10-13 15:21:40,779 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:21:45,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.982e+02 2.189e+02 2.540e+02 3.990e+02, threshold=4.379e+02, percent-clipped=1.0 2023-10-13 15:21:55,522 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:22:04,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1422498.0, ans=0.2 2023-10-13 15:22:06,650 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:22:59,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1422684.6666666667, ans=0.125 2023-10-13 15:23:01,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1422684.6666666667, ans=0.95 2023-10-13 15:23:54,770 INFO [train.py:1031] (0/4) Epoch 23, batch 4500, loss[loss=0.1674, simple_loss=0.267, pruned_loss=0.03395, over 16826.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2801, pruned_loss=0.04843, over 29365073.57 frames. ], batch size: 175, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 15:23:55,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1422918.0, ans=0.125 2023-10-13 15:23:55,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1422918.0, ans=0.125 2023-10-13 15:23:59,054 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.821e+02 1.951e+02 2.134e+02 2.875e+02, threshold=3.901e+02, percent-clipped=0.0 2023-10-13 15:24:00,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1422918.0, ans=0.0 2023-10-13 15:24:20,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1423011.3333333333, ans=0.125 2023-10-13 15:24:21,246 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-13 15:24:30,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1423058.0, ans=0.125 2023-10-13 15:24:36,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1423058.0, ans=0.04949747468305833 2023-10-13 15:24:47,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1423104.6666666667, ans=0.1 2023-10-13 15:24:55,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1423151.3333333333, ans=0.1 2023-10-13 15:25:05,327 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:25:17,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1423244.6666666667, ans=0.2 2023-10-13 15:25:37,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1423338.0, ans=0.5 2023-10-13 15:25:49,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.807e+02 1.963e+02 2.156e+02 3.122e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-13 15:26:03,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1423431.3333333333, ans=10.0 2023-10-13 15:26:04,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1423431.3333333333, ans=0.125 2023-10-13 15:26:09,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1423478.0, ans=0.125 2023-10-13 15:26:15,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1423478.0, ans=0.0 2023-10-13 15:26:25,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1423524.6666666667, ans=0.125 2023-10-13 15:26:37,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423571.3333333333, ans=0.1 2023-10-13 15:26:40,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1423618.0, ans=0.0 2023-10-13 15:27:02,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1423711.3333333333, ans=0.125 2023-10-13 15:27:05,550 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-10-13 15:27:19,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1423758.0, ans=0.125 2023-10-13 15:27:21,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1423758.0, ans=0.125 2023-10-13 15:27:24,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-10-13 15:27:25,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1423804.6666666667, ans=0.125 2023-10-13 15:27:25,587 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2023-10-13 15:27:39,144 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.885e+02 2.091e+02 2.439e+02 4.031e+02, threshold=4.183e+02, percent-clipped=1.0 2023-10-13 15:28:06,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1423944.6666666667, ans=0.125 2023-10-13 15:28:25,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1424038.0, ans=0.125 2023-10-13 15:28:36,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1424084.6666666667, ans=0.125 2023-10-13 15:28:36,661 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-10-13 15:28:38,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1424084.6666666667, ans=0.125 2023-10-13 15:28:43,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.60 vs. limit=15.0 2023-10-13 15:28:45,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1424131.3333333333, ans=0.1 2023-10-13 15:28:48,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1424131.3333333333, ans=0.025 2023-10-13 15:28:55,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424178.0, ans=0.1 2023-10-13 15:29:07,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1424224.6666666667, ans=0.5 2023-10-13 15:29:12,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1424271.3333333333, ans=0.125 2023-10-13 15:29:17,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1424271.3333333333, ans=0.125 2023-10-13 15:29:23,791 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.78 vs. limit=15.0 2023-10-13 15:29:26,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.829e+02 2.045e+02 2.258e+02 3.006e+02, threshold=4.090e+02, percent-clipped=0.0 2023-10-13 15:29:45,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1424364.6666666667, ans=0.1 2023-10-13 15:29:56,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1424411.3333333333, ans=0.125 2023-10-13 15:30:04,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1424458.0, ans=0.125 2023-10-13 15:30:14,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-10-13 15:30:19,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-10-13 15:30:27,032 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.62 vs. limit=15.0 2023-10-13 15:30:46,519 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:31:06,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1424691.3333333333, ans=0.0 2023-10-13 15:31:16,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1424738.0, ans=0.125 2023-10-13 15:31:24,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1424784.6666666667, ans=0.0 2023-10-13 15:31:27,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.727e+02 1.911e+02 2.120e+02 2.693e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-13 15:31:33,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1424784.6666666667, ans=0.2 2023-10-13 15:31:50,902 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:32:00,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424924.6666666667, ans=0.1 2023-10-13 15:32:12,973 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-10-13 15:32:16,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.38 vs. limit=22.5 2023-10-13 15:32:26,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.61 vs. limit=15.0 2023-10-13 15:32:29,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1425018.0, ans=0.0 2023-10-13 15:32:43,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1425064.6666666667, ans=0.1 2023-10-13 15:33:25,062 INFO [train.py:1031] (0/4) Epoch 23, batch 5000, loss[loss=0.1869, simple_loss=0.2786, pruned_loss=0.0476, over 15496.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2796, pruned_loss=0.04849, over 30084339.82 frames. ], batch size: 35, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 15:33:27,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.874e+02 2.072e+02 2.282e+02 2.927e+02, threshold=4.144e+02, percent-clipped=0.0 2023-10-13 15:33:28,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1425251.3333333333, ans=0.125 2023-10-13 15:33:30,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1425251.3333333333, ans=0.125 2023-10-13 15:33:31,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1425251.3333333333, ans=0.2 2023-10-13 15:33:33,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1425251.3333333333, ans=0.125 2023-10-13 15:33:36,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1425251.3333333333, ans=0.125 2023-10-13 15:33:40,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1425298.0, ans=0.07 2023-10-13 15:33:40,794 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.15 vs. limit=15.0 2023-10-13 15:33:46,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1425298.0, ans=0.1 2023-10-13 15:34:04,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1425391.3333333333, ans=0.0 2023-10-13 15:34:31,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1425484.6666666667, ans=0.0 2023-10-13 15:34:58,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1425578.0, ans=0.1 2023-10-13 15:35:21,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1425671.3333333333, ans=0.1 2023-10-13 15:35:23,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1425671.3333333333, ans=0.2 2023-10-13 15:35:36,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.789e+02 1.982e+02 2.203e+02 2.970e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-13 15:36:22,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1425858.0, ans=0.04949747468305833 2023-10-13 15:36:38,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.86 vs. limit=15.0 2023-10-13 15:36:40,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1425951.3333333333, ans=0.0 2023-10-13 15:36:57,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1425998.0, ans=0.125 2023-10-13 15:37:05,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1426044.6666666667, ans=0.0 2023-10-13 15:37:23,063 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-10-13 15:37:38,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1426138.0, ans=0.125 2023-10-13 15:37:46,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.865e+02 2.029e+02 2.239e+02 3.831e+02, threshold=4.057e+02, percent-clipped=0.0 2023-10-13 15:37:46,683 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.99 vs. limit=22.5 2023-10-13 15:38:01,512 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:38:34,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.73 vs. limit=15.0 2023-10-13 15:38:35,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1426371.3333333333, ans=0.2 2023-10-13 15:38:45,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1426418.0, ans=0.0 2023-10-13 15:38:50,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1426418.0, ans=0.0 2023-10-13 15:39:24,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1426558.0, ans=0.125 2023-10-13 15:39:31,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=15.0 2023-10-13 15:39:47,874 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.694e+02 1.917e+02 2.185e+02 2.847e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-13 15:39:48,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1426651.3333333333, ans=0.0 2023-10-13 15:40:00,848 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-10-13 15:40:01,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.28 vs. limit=22.5 2023-10-13 15:40:32,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1426791.3333333333, ans=0.1 2023-10-13 15:40:48,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-10-13 15:41:04,573 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-10-13 15:41:11,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1426978.0, ans=0.2 2023-10-13 15:41:18,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1426978.0, ans=0.125 2023-10-13 15:41:22,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1426978.0, ans=0.2 2023-10-13 15:41:23,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1427024.6666666667, ans=0.125 2023-10-13 15:41:38,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1427071.3333333333, ans=0.125 2023-10-13 15:41:49,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1427118.0, ans=0.0 2023-10-13 15:41:50,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.662e+02 1.859e+02 2.111e+02 2.676e+02, threshold=3.717e+02, percent-clipped=0.0 2023-10-13 15:42:07,295 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.23 vs. limit=10.0 2023-10-13 15:42:10,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1427211.3333333333, ans=0.125 2023-10-13 15:42:27,151 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-10-13 15:42:30,084 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-10-13 15:43:34,455 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-10-13 15:43:38,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1427538.0, ans=0.0 2023-10-13 15:43:45,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1427584.6666666667, ans=0.125 2023-10-13 15:43:46,653 INFO [train.py:1031] (0/4) Epoch 23, batch 5500, loss[loss=0.1855, simple_loss=0.2664, pruned_loss=0.0523, over 16689.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2796, pruned_loss=0.04827, over 30710958.00 frames. ], batch size: 56, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 15:43:50,940 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.795e+02 1.911e+02 2.137e+02 2.758e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-13 15:44:02,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1427631.3333333333, ans=0.1 2023-10-13 15:44:10,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1427678.0, ans=0.125 2023-10-13 15:44:13,290 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:44:20,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-10-13 15:44:22,362 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=22.5 2023-10-13 15:44:36,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1427771.3333333333, ans=0.0 2023-10-13 15:44:40,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1427771.3333333333, ans=0.125 2023-10-13 15:45:02,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1427864.6666666667, ans=0.0 2023-10-13 15:45:30,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1428004.6666666667, ans=0.07 2023-10-13 15:45:35,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-10-13 15:45:36,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1428004.6666666667, ans=0.1 2023-10-13 15:45:46,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.780e+02 1.897e+02 2.131e+02 2.790e+02, threshold=3.794e+02, percent-clipped=0.0 2023-10-13 15:46:00,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1428098.0, ans=0.0 2023-10-13 15:46:02,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1428098.0, ans=0.125 2023-10-13 15:46:14,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1428191.3333333333, ans=0.125 2023-10-13 15:46:40,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-10-13 15:47:17,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1428424.6666666667, ans=0.0 2023-10-13 15:47:40,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1428518.0, ans=0.0 2023-10-13 15:47:43,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1428518.0, ans=0.0 2023-10-13 15:47:45,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 1.862e+02 2.120e+02 2.487e+02 3.238e+02, threshold=4.239e+02, percent-clipped=0.0 2023-10-13 15:47:58,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1428564.6666666667, ans=0.1 2023-10-13 15:48:14,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1428658.0, ans=0.05 2023-10-13 15:48:16,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1428658.0, ans=0.0 2023-10-13 15:48:44,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1428751.3333333333, ans=0.125 2023-10-13 15:49:09,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1428891.3333333333, ans=0.125 2023-10-13 15:49:11,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.42 vs. limit=15.0 2023-10-13 15:49:25,139 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:49:29,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=15.0 2023-10-13 15:49:34,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1428984.6666666667, ans=0.125 2023-10-13 15:49:39,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.783e+02 1.977e+02 2.195e+02 3.165e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-13 15:49:47,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1429031.3333333333, ans=0.1 2023-10-13 15:49:54,121 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.05 vs. limit=22.5 2023-10-13 15:50:18,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1429124.6666666667, ans=0.125 2023-10-13 15:50:18,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1429124.6666666667, ans=0.2 2023-10-13 15:50:21,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1429124.6666666667, ans=0.1 2023-10-13 15:50:30,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1429171.3333333333, ans=0.1 2023-10-13 15:50:46,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1429218.0, ans=0.1 2023-10-13 15:50:46,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1429218.0, ans=0.05 2023-10-13 15:50:50,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1429218.0, ans=0.1 2023-10-13 15:50:55,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1429264.6666666667, ans=0.0 2023-10-13 15:50:57,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1429264.6666666667, ans=0.015 2023-10-13 15:50:57,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1429264.6666666667, ans=0.125 2023-10-13 15:51:16,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1429358.0, ans=0.0 2023-10-13 15:51:18,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1429358.0, ans=0.2 2023-10-13 15:51:30,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.11 vs. limit=22.5 2023-10-13 15:51:38,700 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:51:46,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.857e+02 1.999e+02 2.203e+02 3.065e+02, threshold=3.999e+02, percent-clipped=0.0 2023-10-13 15:52:10,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-10-13 15:52:18,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1429591.3333333333, ans=0.0 2023-10-13 15:52:19,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1429591.3333333333, ans=0.125 2023-10-13 15:52:33,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1429638.0, ans=0.1 2023-10-13 15:52:47,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1429684.6666666667, ans=0.0 2023-10-13 15:52:47,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1429684.6666666667, ans=0.0 2023-10-13 15:53:01,906 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:53:02,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1429778.0, ans=0.125 2023-10-13 15:53:06,366 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-10-13 15:53:09,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1429778.0, ans=0.0 2023-10-13 15:53:09,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1429778.0, ans=0.125 2023-10-13 15:53:15,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1429824.6666666667, ans=0.0 2023-10-13 15:53:18,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1429824.6666666667, ans=0.0 2023-10-13 15:53:20,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1429824.6666666667, ans=0.1 2023-10-13 15:53:41,361 INFO [train.py:1031] (0/4) Epoch 23, batch 6000, loss[loss=0.1959, simple_loss=0.2888, pruned_loss=0.05149, over 16821.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.28, pruned_loss=0.04859, over 31156221.34 frames. ], batch size: 175, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 15:53:48,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.748e+02 1.983e+02 2.154e+02 2.929e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-13 15:53:54,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1429964.6666666667, ans=0.125 2023-10-13 15:54:17,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1430011.3333333333, ans=0.0 2023-10-13 15:54:26,240 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 15:54:39,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1430104.6666666667, ans=0.125 2023-10-13 15:54:59,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1430198.0, ans=0.1 2023-10-13 15:55:01,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1430198.0, ans=0.125 2023-10-13 15:55:17,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1430244.6666666667, ans=0.0 2023-10-13 15:55:26,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.60 vs. limit=15.0 2023-10-13 15:55:29,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1430291.3333333333, ans=0.0 2023-10-13 15:55:30,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1430291.3333333333, ans=0.125 2023-10-13 15:55:33,203 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.83 vs. limit=22.5 2023-10-13 15:55:35,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.26 vs. limit=15.0 2023-10-13 15:55:46,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1430384.6666666667, ans=0.0 2023-10-13 15:55:50,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1430384.6666666667, ans=0.05 2023-10-13 15:55:52,139 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.777e+02 1.921e+02 2.075e+02 2.854e+02, threshold=3.841e+02, percent-clipped=0.0 2023-10-13 15:56:11,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-10-13 15:56:13,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1430478.0, ans=0.0 2023-10-13 15:56:32,701 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-10-13 15:56:51,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1430618.0, ans=0.125 2023-10-13 15:57:01,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1430664.6666666667, ans=0.0 2023-10-13 15:57:33,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1430804.6666666667, ans=0.0 2023-10-13 15:57:41,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1430804.6666666667, ans=0.125 2023-10-13 15:57:43,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1430804.6666666667, ans=0.125 2023-10-13 15:57:52,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.822e+02 1.956e+02 2.154e+02 2.813e+02, threshold=3.911e+02, percent-clipped=0.0 2023-10-13 15:57:53,446 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.57 vs. limit=15.0 2023-10-13 15:58:02,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1430898.0, ans=0.125 2023-10-13 15:58:11,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1430944.6666666667, ans=0.125 2023-10-13 15:58:17,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1430944.6666666667, ans=0.0 2023-10-13 15:59:24,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1431178.0, ans=0.125 2023-10-13 16:00:07,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.843e+02 2.036e+02 2.211e+02 3.149e+02, threshold=4.072e+02, percent-clipped=0.0 2023-10-13 16:00:17,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1431364.6666666667, ans=0.0 2023-10-13 16:00:39,649 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-13 16:01:14,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1431551.3333333333, ans=0.2 2023-10-13 16:01:17,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.45 vs. limit=15.0 2023-10-13 16:01:24,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1431598.0, ans=0.1 2023-10-13 16:01:44,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1431644.6666666667, ans=0.125 2023-10-13 16:01:51,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1431691.3333333333, ans=0.0 2023-10-13 16:02:20,134 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.802e+02 1.953e+02 2.191e+02 3.655e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-13 16:02:23,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1431831.3333333333, ans=0.07 2023-10-13 16:03:05,260 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.39 vs. limit=15.0 2023-10-13 16:03:26,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1432064.6666666667, ans=0.0 2023-10-13 16:03:49,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1432111.3333333333, ans=0.025 2023-10-13 16:03:57,169 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-10-13 16:04:16,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1432251.3333333333, ans=0.1 2023-10-13 16:04:17,697 INFO [train.py:1031] (0/4) Epoch 23, batch 6500, loss[loss=0.1621, simple_loss=0.2573, pruned_loss=0.0335, over 16447.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2803, pruned_loss=0.04858, over 31516749.66 frames. ], batch size: 50, lr: 1.50e-03, grad_scale: 16.0 2023-10-13 16:04:22,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1432251.3333333333, ans=0.0 2023-10-13 16:04:28,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.901e+02 2.138e+02 2.399e+02 2.972e+02, threshold=4.275e+02, percent-clipped=0.0 2023-10-13 16:04:32,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-10-13 16:04:36,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1432298.0, ans=0.125 2023-10-13 16:05:09,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1432391.3333333333, ans=0.0 2023-10-13 16:05:33,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.17 vs. limit=15.0 2023-10-13 16:06:03,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1432578.0, ans=0.2 2023-10-13 16:06:09,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1432624.6666666667, ans=0.125 2023-10-13 16:06:25,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1432671.3333333333, ans=0.025 2023-10-13 16:06:27,071 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-10-13 16:06:42,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.369e+02 1.787e+02 1.902e+02 2.050e+02 2.910e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-13 16:06:56,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1432764.6666666667, ans=0.0 2023-10-13 16:06:58,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1432811.3333333333, ans=0.0 2023-10-13 16:07:19,810 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-10-13 16:07:42,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1432951.3333333333, ans=0.07 2023-10-13 16:07:54,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432998.0, ans=0.1 2023-10-13 16:08:06,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1433044.6666666667, ans=0.0 2023-10-13 16:08:41,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.786e+02 1.996e+02 2.181e+02 3.146e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 16:09:24,479 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.10 vs. limit=22.5 2023-10-13 16:09:43,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1433418.0, ans=0.125 2023-10-13 16:09:45,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1433464.6666666667, ans=0.2 2023-10-13 16:09:48,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1433464.6666666667, ans=0.125 2023-10-13 16:09:55,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.38 vs. limit=15.0 2023-10-13 16:10:36,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1433604.6666666667, ans=0.05 2023-10-13 16:10:48,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1433651.3333333333, ans=0.125 2023-10-13 16:10:57,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.34 vs. limit=22.5 2023-10-13 16:10:57,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.687e+02 1.878e+02 2.141e+02 3.146e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-13 16:11:11,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1433698.0, ans=0.1 2023-10-13 16:11:11,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1433698.0, ans=0.125 2023-10-13 16:11:22,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-10-13 16:11:27,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1433791.3333333333, ans=0.125 2023-10-13 16:11:35,277 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.79 vs. limit=6.0 2023-10-13 16:11:53,643 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.46 vs. limit=6.0 2023-10-13 16:12:10,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.54 vs. limit=15.0 2023-10-13 16:12:48,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1434071.3333333333, ans=0.125 2023-10-13 16:12:53,959 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.63 vs. limit=15.0 2023-10-13 16:13:03,698 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.694e+02 1.874e+02 2.041e+02 2.904e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 16:13:05,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-10-13 16:13:36,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1434258.0, ans=0.0 2023-10-13 16:13:46,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1434304.6666666667, ans=0.125 2023-10-13 16:13:46,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1434304.6666666667, ans=0.125 2023-10-13 16:13:49,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1434304.6666666667, ans=0.5 2023-10-13 16:13:54,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1434351.3333333333, ans=0.1 2023-10-13 16:13:57,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1434351.3333333333, ans=0.125 2023-10-13 16:14:06,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.56 vs. limit=15.0 2023-10-13 16:14:19,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1434444.6666666667, ans=0.0 2023-10-13 16:14:24,407 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-10-13 16:14:37,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1434491.3333333333, ans=0.125 2023-10-13 16:14:47,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1434538.0, ans=0.125 2023-10-13 16:14:56,210 INFO [train.py:1031] (0/4) Epoch 23, batch 7000, loss[loss=0.1836, simple_loss=0.2807, pruned_loss=0.04331, over 16807.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2808, pruned_loss=0.04852, over 31789026.36 frames. ], batch size: 98, lr: 1.50e-03, grad_scale: 16.0 2023-10-13 16:14:56,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1434584.6666666667, ans=0.125 2023-10-13 16:14:57,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1434584.6666666667, ans=0.125 2023-10-13 16:14:59,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.99 vs. limit=15.0 2023-10-13 16:15:04,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.834e+02 1.957e+02 2.282e+02 3.219e+02, threshold=3.914e+02, percent-clipped=0.0 2023-10-13 16:15:23,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1434678.0, ans=0.1 2023-10-13 16:15:24,606 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 16:16:05,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1434818.0, ans=0.2 2023-10-13 16:16:12,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=15.0 2023-10-13 16:16:15,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1434864.6666666667, ans=22.5 2023-10-13 16:16:17,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1434864.6666666667, ans=0.125 2023-10-13 16:16:20,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1434864.6666666667, ans=0.125 2023-10-13 16:16:27,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1434911.3333333333, ans=0.1 2023-10-13 16:16:28,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1434911.3333333333, ans=0.0 2023-10-13 16:16:37,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1434958.0, ans=0.0 2023-10-13 16:16:37,862 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-10-13 16:16:59,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1435051.3333333333, ans=0.0 2023-10-13 16:17:01,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1435051.3333333333, ans=0.125 2023-10-13 16:17:02,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1435051.3333333333, ans=0.2 2023-10-13 16:17:03,511 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.828e+02 1.958e+02 2.290e+02 2.902e+02, threshold=3.916e+02, percent-clipped=0.0 2023-10-13 16:17:08,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1435098.0, ans=0.2 2023-10-13 16:17:09,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1435098.0, ans=0.125 2023-10-13 16:17:10,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1435098.0, ans=0.0 2023-10-13 16:17:10,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1435098.0, ans=0.125 2023-10-13 16:17:21,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1435144.6666666667, ans=0.125 2023-10-13 16:17:27,117 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.56 vs. limit=22.5 2023-10-13 16:17:30,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1435191.3333333333, ans=0.0 2023-10-13 16:17:34,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1435191.3333333333, ans=0.0 2023-10-13 16:17:39,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1435238.0, ans=0.1 2023-10-13 16:17:40,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1435238.0, ans=0.2 2023-10-13 16:17:56,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.84 vs. limit=10.0 2023-10-13 16:17:58,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1435284.6666666667, ans=0.125 2023-10-13 16:17:59,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1435284.6666666667, ans=0.025 2023-10-13 16:18:05,864 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 16:18:10,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1435331.3333333333, ans=0.125 2023-10-13 16:18:11,351 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-13 16:18:22,991 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-10-13 16:18:33,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1435424.6666666667, ans=0.125 2023-10-13 16:18:49,354 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.95 vs. limit=10.0 2023-10-13 16:19:01,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.802e+02 1.986e+02 2.152e+02 2.844e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-13 16:19:02,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435564.6666666667, ans=0.1 2023-10-13 16:19:41,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1435658.0, ans=0.02 2023-10-13 16:19:46,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1435704.6666666667, ans=0.2 2023-10-13 16:19:50,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435704.6666666667, ans=0.1 2023-10-13 16:19:56,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1435704.6666666667, ans=0.125 2023-10-13 16:20:04,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1435751.3333333333, ans=0.125 2023-10-13 16:20:24,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1435798.0, ans=0.125 2023-10-13 16:20:30,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1435844.6666666667, ans=0.125 2023-10-13 16:20:46,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.86 vs. limit=15.0 2023-10-13 16:20:59,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1435938.0, ans=0.125 2023-10-13 16:21:04,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435984.6666666667, ans=0.1 2023-10-13 16:21:11,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1435984.6666666667, ans=0.05 2023-10-13 16:21:13,278 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.52 vs. limit=22.5 2023-10-13 16:21:17,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.797e+02 2.032e+02 2.285e+02 3.283e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-13 16:21:35,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1436078.0, ans=0.2 2023-10-13 16:21:50,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1436124.6666666667, ans=0.125 2023-10-13 16:22:20,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1436264.6666666667, ans=0.1 2023-10-13 16:22:23,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1436264.6666666667, ans=0.0 2023-10-13 16:22:35,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1436311.3333333333, ans=0.0 2023-10-13 16:22:54,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1436404.6666666667, ans=0.125 2023-10-13 16:22:59,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1436404.6666666667, ans=0.0 2023-10-13 16:23:13,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1436451.3333333333, ans=0.1 2023-10-13 16:23:14,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.751e+02 1.904e+02 2.128e+02 2.870e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-13 16:23:15,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1436498.0, ans=0.125 2023-10-13 16:23:20,394 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.39 vs. limit=15.0 2023-10-13 16:23:37,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1436544.6666666667, ans=0.0 2023-10-13 16:23:55,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1436638.0, ans=0.95 2023-10-13 16:24:03,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1436684.6666666667, ans=0.125 2023-10-13 16:24:37,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1436824.6666666667, ans=0.125 2023-10-13 16:24:38,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1436824.6666666667, ans=0.125 2023-10-13 16:25:00,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.63 vs. limit=15.0 2023-10-13 16:25:01,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1436918.0, ans=0.125 2023-10-13 16:25:01,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.55 vs. limit=15.0 2023-10-13 16:25:02,058 INFO [train.py:1031] (0/4) Epoch 23, batch 7500, loss[loss=0.1817, simple_loss=0.2531, pruned_loss=0.05516, over 12390.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2806, pruned_loss=0.04868, over 31968436.41 frames. ], batch size: 440, lr: 1.50e-03, grad_scale: 32.0 2023-10-13 16:25:04,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.79 vs. limit=15.0 2023-10-13 16:25:07,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1436918.0, ans=0.125 2023-10-13 16:25:11,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.763e+02 1.970e+02 2.127e+02 2.890e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-13 16:25:22,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.22 vs. limit=6.0 2023-10-13 16:25:37,711 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.57 vs. limit=15.0 2023-10-13 16:25:38,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1437058.0, ans=0.0 2023-10-13 16:25:50,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-10-13 16:25:53,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1437104.6666666667, ans=0.125 2023-10-13 16:26:00,229 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.25 vs. limit=22.5 2023-10-13 16:26:01,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1437151.3333333333, ans=0.07 2023-10-13 16:26:03,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1437151.3333333333, ans=0.125 2023-10-13 16:26:19,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1437198.0, ans=0.125 2023-10-13 16:26:46,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1437291.3333333333, ans=0.125 2023-10-13 16:26:48,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1437291.3333333333, ans=0.125 2023-10-13 16:26:48,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1437291.3333333333, ans=0.0 2023-10-13 16:26:57,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1437338.0, ans=0.125 2023-10-13 16:27:04,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1437384.6666666667, ans=0.1 2023-10-13 16:27:08,240 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-10-13 16:27:15,319 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.770e+02 1.941e+02 2.174e+02 3.115e+02, threshold=3.882e+02, percent-clipped=0.0 2023-10-13 16:27:48,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1437524.6666666667, ans=0.0 2023-10-13 16:27:52,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1437524.6666666667, ans=0.125 2023-10-13 16:28:12,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=1437571.3333333333, ans=15.0 2023-10-13 16:28:16,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1437618.0, ans=0.0 2023-10-13 16:28:21,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1437618.0, ans=0.035 2023-10-13 16:28:21,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1437618.0, ans=0.125 2023-10-13 16:28:23,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1437618.0, ans=0.125 2023-10-13 16:28:28,549 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.11 vs. limit=10.0 2023-10-13 16:29:03,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1437758.0, ans=0.1 2023-10-13 16:29:06,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.55 vs. limit=22.5 2023-10-13 16:29:20,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1437804.6666666667, ans=0.125 2023-10-13 16:29:40,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.769e+02 1.911e+02 2.100e+02 3.111e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-13 16:29:49,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1437944.6666666667, ans=0.0 2023-10-13 16:30:01,991 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 16:30:09,597 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.05 vs. limit=15.0 2023-10-13 16:30:25,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1438038.0, ans=0.0 2023-10-13 16:30:26,645 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-13 16:30:27,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1438038.0, ans=0.1 2023-10-13 16:30:39,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1438084.6666666667, ans=15.0 2023-10-13 16:30:40,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1438084.6666666667, ans=0.125 2023-10-13 16:30:44,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1438131.3333333333, ans=0.2 2023-10-13 16:30:56,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1438178.0, ans=0.5 2023-10-13 16:30:57,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1438178.0, ans=0.125 2023-10-13 16:31:04,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1438178.0, ans=0.125 2023-10-13 16:31:17,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1438271.3333333333, ans=0.1 2023-10-13 16:31:27,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1438318.0, ans=0.125 2023-10-13 16:31:31,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.41 vs. limit=15.0 2023-10-13 16:31:43,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.851e+02 2.001e+02 2.200e+02 4.052e+02, threshold=4.001e+02, percent-clipped=1.0 2023-10-13 16:31:53,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.19 vs. limit=15.0 2023-10-13 16:32:09,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2023-10-13 16:32:37,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1438551.3333333333, ans=0.0 2023-10-13 16:32:51,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1438598.0, ans=0.125 2023-10-13 16:32:54,228 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.94 vs. limit=12.0 2023-10-13 16:33:35,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1438738.0, ans=0.0 2023-10-13 16:34:00,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.777e+02 1.947e+02 2.096e+02 2.525e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-13 16:34:21,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1438878.0, ans=0.125 2023-10-13 16:34:23,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1438878.0, ans=0.125 2023-10-13 16:34:40,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1438971.3333333333, ans=0.035 2023-10-13 16:34:41,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1438971.3333333333, ans=0.125 2023-10-13 16:35:06,182 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-10-13 16:35:11,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1439064.6666666667, ans=0.2 2023-10-13 16:35:21,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1439111.3333333333, ans=0.05 2023-10-13 16:35:24,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1439111.3333333333, ans=0.0 2023-10-13 16:35:33,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.36 vs. limit=15.0 2023-10-13 16:35:55,716 INFO [train.py:1031] (0/4) Epoch 23, batch 8000, loss[loss=0.1776, simple_loss=0.2735, pruned_loss=0.04082, over 16909.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2801, pruned_loss=0.04816, over 32177353.58 frames. ], batch size: 72, lr: 1.50e-03, grad_scale: 16.0 2023-10-13 16:36:06,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1439298.0, ans=0.125 2023-10-13 16:36:08,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.666e+02 1.834e+02 2.012e+02 3.002e+02, threshold=3.668e+02, percent-clipped=0.0 2023-10-13 16:36:18,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1439344.6666666667, ans=0.1 2023-10-13 16:36:23,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1439344.6666666667, ans=0.0 2023-10-13 16:36:33,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1439391.3333333333, ans=0.025 2023-10-13 16:36:36,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.76 vs. limit=22.5 2023-10-13 16:37:12,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1439531.3333333333, ans=0.125 2023-10-13 16:37:35,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1439624.6666666667, ans=0.2 2023-10-13 16:38:00,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1439764.6666666667, ans=0.0 2023-10-13 16:38:02,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.750e+02 1.900e+02 2.051e+02 2.511e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-13 16:38:07,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1439764.6666666667, ans=0.2 2023-10-13 16:38:22,871 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.91 vs. limit=15.0 2023-10-13 16:38:36,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1439904.6666666667, ans=0.0 2023-10-13 16:39:28,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.30 vs. limit=15.0 2023-10-13 16:39:50,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1440138.0, ans=0.0 2023-10-13 16:39:55,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1440138.0, ans=0.125 2023-10-13 16:40:14,605 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.825e+02 1.970e+02 2.326e+02 3.328e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-13 16:41:49,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1440511.3333333333, ans=0.1 2023-10-13 16:42:01,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1440558.0, ans=0.1 2023-10-13 16:42:14,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1440604.6666666667, ans=0.2 2023-10-13 16:42:33,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1440698.0, ans=0.125 2023-10-13 16:42:35,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.792e+02 1.949e+02 2.201e+02 2.893e+02, threshold=3.898e+02, percent-clipped=0.0 2023-10-13 16:42:42,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1440698.0, ans=0.0 2023-10-13 16:42:49,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.17 vs. limit=8.0 2023-10-13 16:42:53,069 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-10-13 16:42:53,826 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.38 vs. limit=15.0 2023-10-13 16:42:56,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1440744.6666666667, ans=0.2 2023-10-13 16:43:32,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1440884.6666666667, ans=0.0 2023-10-13 16:43:41,888 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-10-13 16:44:32,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1441118.0, ans=0.0 2023-10-13 16:44:38,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.803e+02 1.996e+02 2.210e+02 3.797e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-13 16:45:34,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2023-10-13 16:46:16,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1441491.3333333333, ans=0.0 2023-10-13 16:46:25,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1441538.0, ans=0.125 2023-10-13 16:46:27,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.11 vs. limit=15.0 2023-10-13 16:46:34,570 INFO [train.py:1031] (0/4) Epoch 23, batch 8500, loss[loss=0.1937, simple_loss=0.2866, pruned_loss=0.05034, over 16884.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2805, pruned_loss=0.04815, over 32335550.64 frames. ], batch size: 116, lr: 1.49e-03, grad_scale: 32.0 2023-10-13 16:46:48,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.800e+02 1.973e+02 2.175e+02 2.720e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-13 16:46:49,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1441631.3333333333, ans=0.0 2023-10-13 16:47:11,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1441678.0, ans=0.5 2023-10-13 16:47:28,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1441771.3333333333, ans=0.125 2023-10-13 16:47:30,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1441771.3333333333, ans=0.125 2023-10-13 16:47:38,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1441818.0, ans=0.2 2023-10-13 16:47:39,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1441818.0, ans=0.035 2023-10-13 16:47:46,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1441818.0, ans=0.1 2023-10-13 16:47:56,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1441864.6666666667, ans=0.125 2023-10-13 16:47:58,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1441864.6666666667, ans=0.1 2023-10-13 16:48:26,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1441958.0, ans=0.125 2023-10-13 16:48:34,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1442004.6666666667, ans=0.125 2023-10-13 16:48:52,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1442051.3333333333, ans=0.0 2023-10-13 16:49:02,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.845e+02 2.071e+02 2.367e+02 3.411e+02, threshold=4.142e+02, percent-clipped=0.0 2023-10-13 16:49:28,079 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-13 16:49:41,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1442238.0, ans=0.025 2023-10-13 16:49:43,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1442238.0, ans=0.125 2023-10-13 16:50:00,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1442284.6666666667, ans=0.2 2023-10-13 16:50:12,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1442331.3333333333, ans=0.0 2023-10-13 16:50:14,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1442331.3333333333, ans=0.125 2023-10-13 16:50:29,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1442378.0, ans=0.125 2023-10-13 16:50:31,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1442378.0, ans=0.1 2023-10-13 16:50:38,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1442378.0, ans=0.1 2023-10-13 16:50:46,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1442424.6666666667, ans=0.2 2023-10-13 16:51:08,032 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.02 vs. limit=12.0 2023-10-13 16:51:12,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1442518.0, ans=0.2 2023-10-13 16:51:20,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1442564.6666666667, ans=0.125 2023-10-13 16:51:24,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1442564.6666666667, ans=0.2 2023-10-13 16:51:24,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.672e+02 1.824e+02 1.947e+02 2.708e+02, threshold=3.648e+02, percent-clipped=0.0 2023-10-13 16:51:27,172 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.81 vs. limit=15.0 2023-10-13 16:51:36,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1442611.3333333333, ans=0.125 2023-10-13 16:51:43,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.09 vs. limit=15.0 2023-10-13 16:51:51,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1442658.0, ans=0.125 2023-10-13 16:51:58,111 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-10-13 16:52:00,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1442704.6666666667, ans=0.0 2023-10-13 16:52:20,268 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.75 vs. limit=12.0 2023-10-13 16:52:34,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1442798.0, ans=0.125 2023-10-13 16:52:39,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1442798.0, ans=0.125 2023-10-13 16:53:02,466 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.09 vs. limit=12.0 2023-10-13 16:53:07,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1442938.0, ans=0.0 2023-10-13 16:53:34,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1443031.3333333333, ans=0.0 2023-10-13 16:53:36,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.715e+02 1.886e+02 2.042e+02 3.010e+02, threshold=3.772e+02, percent-clipped=0.0 2023-10-13 16:54:07,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1443124.6666666667, ans=0.125 2023-10-13 16:54:25,496 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-13 16:54:42,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-10-13 16:54:55,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1443311.3333333333, ans=0.0 2023-10-13 16:54:56,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1443311.3333333333, ans=0.125 2023-10-13 16:54:58,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1443311.3333333333, ans=0.125 2023-10-13 16:54:59,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1443358.0, ans=0.2 2023-10-13 16:55:38,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.797e+02 1.951e+02 2.101e+02 2.636e+02, threshold=3.903e+02, percent-clipped=0.0 2023-10-13 16:55:46,705 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.36 vs. limit=15.0 2023-10-13 16:56:01,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1443544.6666666667, ans=15.0 2023-10-13 16:56:02,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1443591.3333333333, ans=0.125 2023-10-13 16:56:38,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1443684.6666666667, ans=0.0 2023-10-13 16:56:46,356 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-10-13 16:56:56,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1443778.0, ans=0.0 2023-10-13 16:56:59,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1443778.0, ans=0.125 2023-10-13 16:57:02,771 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 16:57:02,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1443778.0, ans=0.125 2023-10-13 16:57:26,795 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.52 vs. limit=12.0 2023-10-13 16:57:32,160 INFO [train.py:1031] (0/4) Epoch 23, batch 9000, loss[loss=0.1847, simple_loss=0.273, pruned_loss=0.04818, over 15593.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2799, pruned_loss=0.04798, over 32442614.12 frames. ], batch size: 35, lr: 1.49e-03, grad_scale: 16.0 2023-10-13 16:57:32,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1443918.0, ans=0.125 2023-10-13 16:57:36,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1443918.0, ans=0.1 2023-10-13 16:57:42,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1443964.6666666667, ans=0.0 2023-10-13 16:57:47,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.798e+02 1.968e+02 2.304e+02 3.237e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-13 16:58:08,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1444058.0, ans=0.125 2023-10-13 16:58:18,220 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.63 vs. limit=15.0 2023-10-13 16:58:24,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1444104.6666666667, ans=0.125 2023-10-13 16:58:30,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1444104.6666666667, ans=0.1 2023-10-13 16:59:22,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1444338.0, ans=0.125 2023-10-13 16:59:47,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1444431.3333333333, ans=0.0 2023-10-13 16:59:48,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.732e+02 1.938e+02 2.179e+02 3.070e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-13 16:59:57,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1444478.0, ans=0.0 2023-10-13 17:00:04,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1444478.0, ans=0.125 2023-10-13 17:00:09,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1444524.6666666667, ans=0.02 2023-10-13 17:00:14,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1444524.6666666667, ans=0.125 2023-10-13 17:00:16,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1444524.6666666667, ans=0.0 2023-10-13 17:00:48,305 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.10 vs. limit=15.0 2023-10-13 17:01:15,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1444758.0, ans=0.0 2023-10-13 17:01:27,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1444804.6666666667, ans=0.0 2023-10-13 17:01:55,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.805e+02 1.966e+02 2.223e+02 3.098e+02, threshold=3.931e+02, percent-clipped=0.0 2023-10-13 17:02:08,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1444944.6666666667, ans=0.125 2023-10-13 17:02:28,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1445038.0, ans=0.09899494936611666 2023-10-13 17:02:38,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1445038.0, ans=0.04949747468305833 2023-10-13 17:02:39,228 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.95 vs. limit=15.0 2023-10-13 17:02:47,311 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:03:04,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1445178.0, ans=0.125 2023-10-13 17:03:19,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1445224.6666666667, ans=0.125 2023-10-13 17:03:37,727 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-10-13 17:03:48,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1445364.6666666667, ans=0.0 2023-10-13 17:03:52,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1445364.6666666667, ans=0.95 2023-10-13 17:03:52,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1445364.6666666667, ans=0.1 2023-10-13 17:03:54,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.797e+02 1.968e+02 2.169e+02 2.957e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-13 17:04:13,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1445458.0, ans=0.0 2023-10-13 17:04:18,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1445458.0, ans=0.1 2023-10-13 17:04:33,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1445504.6666666667, ans=0.125 2023-10-13 17:04:35,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-10-13 17:04:41,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1445551.3333333333, ans=0.0 2023-10-13 17:04:45,666 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.52 vs. limit=22.5 2023-10-13 17:05:01,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-10-13 17:05:05,984 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.27 vs. limit=15.0 2023-10-13 17:05:08,491 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-10-13 17:05:08,583 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.08 vs. limit=15.0 2023-10-13 17:05:40,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1445738.0, ans=0.125 2023-10-13 17:06:12,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.807e+02 1.980e+02 2.218e+02 3.235e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-13 17:06:57,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1445971.3333333333, ans=0.2 2023-10-13 17:07:24,932 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=15.0 2023-10-13 17:07:42,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1446111.3333333333, ans=0.125 2023-10-13 17:08:08,800 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.013e-02 2023-10-13 17:08:17,131 INFO [train.py:1031] (0/4) Epoch 23, batch 9500, loss[loss=0.1927, simple_loss=0.2976, pruned_loss=0.04391, over 16829.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2806, pruned_loss=0.04818, over 32560749.07 frames. ], batch size: 175, lr: 1.49e-03, grad_scale: 16.0 2023-10-13 17:08:33,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1446298.0, ans=0.2 2023-10-13 17:08:37,431 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.889e+02 2.062e+02 2.205e+02 3.305e+02, threshold=4.124e+02, percent-clipped=0.0 2023-10-13 17:08:37,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1446298.0, ans=0.2 2023-10-13 17:08:49,571 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.56 vs. limit=12.0 2023-10-13 17:08:54,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=15.0 2023-10-13 17:09:00,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-10-13 17:09:23,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1446484.6666666667, ans=0.2 2023-10-13 17:09:33,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1446531.3333333333, ans=0.5 2023-10-13 17:09:45,652 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.14 vs. limit=15.0 2023-10-13 17:09:49,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1446578.0, ans=0.5 2023-10-13 17:09:58,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1446624.6666666667, ans=0.1 2023-10-13 17:10:24,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1446718.0, ans=0.2 2023-10-13 17:10:38,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.832e+02 1.959e+02 2.197e+02 3.091e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-13 17:10:48,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1446811.3333333333, ans=0.1 2023-10-13 17:11:25,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1446904.6666666667, ans=0.125 2023-10-13 17:11:35,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1446951.3333333333, ans=0.125 2023-10-13 17:11:42,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1446998.0, ans=0.0 2023-10-13 17:11:43,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1446998.0, ans=0.125 2023-10-13 17:12:10,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1447091.3333333333, ans=10.0 2023-10-13 17:12:19,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1447091.3333333333, ans=0.125 2023-10-13 17:12:25,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-10-13 17:12:50,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.833e+02 1.997e+02 2.164e+02 3.370e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-13 17:12:52,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1447231.3333333333, ans=0.125 2023-10-13 17:13:14,334 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:13:23,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1447371.3333333333, ans=0.125 2023-10-13 17:13:24,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1447371.3333333333, ans=0.2 2023-10-13 17:13:33,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1447418.0, ans=0.1 2023-10-13 17:13:41,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1447464.6666666667, ans=0.125 2023-10-13 17:13:50,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1447464.6666666667, ans=0.125 2023-10-13 17:13:53,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1447464.6666666667, ans=0.125 2023-10-13 17:14:00,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1447511.3333333333, ans=0.125 2023-10-13 17:14:04,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1447511.3333333333, ans=0.125 2023-10-13 17:14:20,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1447558.0, ans=0.0 2023-10-13 17:14:20,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1447558.0, ans=0.0 2023-10-13 17:14:20,476 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=15.0 2023-10-13 17:14:22,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1447558.0, ans=0.125 2023-10-13 17:14:22,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1447558.0, ans=0.2 2023-10-13 17:14:23,412 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.81 vs. limit=12.0 2023-10-13 17:14:29,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1447604.6666666667, ans=0.0 2023-10-13 17:14:46,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1447698.0, ans=0.2 2023-10-13 17:14:53,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.786e+02 1.944e+02 2.128e+02 3.330e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-13 17:15:26,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1447838.0, ans=0.2 2023-10-13 17:15:41,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1447884.6666666667, ans=0.05 2023-10-13 17:16:00,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1447931.3333333333, ans=0.125 2023-10-13 17:16:10,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1447978.0, ans=0.025 2023-10-13 17:16:14,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1447978.0, ans=0.025 2023-10-13 17:16:18,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1448024.6666666667, ans=0.1 2023-10-13 17:16:58,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.753e+02 1.894e+02 2.074e+02 2.901e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-13 17:18:29,642 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2023-10-13 17:18:37,171 INFO [train.py:1031] (0/4) Epoch 23, batch 10000, loss[loss=0.1809, simple_loss=0.266, pruned_loss=0.04795, over 15613.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2795, pruned_loss=0.04784, over 32575780.19 frames. ], batch size: 35, lr: 1.49e-03, grad_scale: 32.0 2023-10-13 17:18:40,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1448584.6666666667, ans=0.125 2023-10-13 17:18:44,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1448584.6666666667, ans=0.125 2023-10-13 17:18:46,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1448584.6666666667, ans=0.1 2023-10-13 17:18:54,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.758e+02 1.928e+02 2.143e+02 3.649e+02, threshold=3.856e+02, percent-clipped=0.0 2023-10-13 17:18:57,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1448631.3333333333, ans=0.07 2023-10-13 17:18:57,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1448631.3333333333, ans=0.2 2023-10-13 17:19:15,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1448724.6666666667, ans=0.2 2023-10-13 17:19:41,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1448818.0, ans=15.0 2023-10-13 17:19:42,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1448818.0, ans=0.125 2023-10-13 17:20:02,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-10-13 17:20:10,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1448911.3333333333, ans=0.125 2023-10-13 17:20:32,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1449004.6666666667, ans=0.125 2023-10-13 17:20:38,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1449051.3333333333, ans=0.1 2023-10-13 17:20:59,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.838e+02 1.989e+02 2.270e+02 3.130e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-13 17:21:14,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1449144.6666666667, ans=0.1 2023-10-13 17:21:37,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1449238.0, ans=0.125 2023-10-13 17:21:51,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1449331.3333333333, ans=0.1 2023-10-13 17:22:16,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1449424.6666666667, ans=0.0 2023-10-13 17:22:21,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1449424.6666666667, ans=0.0 2023-10-13 17:22:24,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1449424.6666666667, ans=0.0 2023-10-13 17:22:30,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1449471.3333333333, ans=0.125 2023-10-13 17:22:59,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1449564.6666666667, ans=0.125 2023-10-13 17:23:00,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1449564.6666666667, ans=0.025 2023-10-13 17:23:02,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.829e+02 1.983e+02 2.185e+02 2.990e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-13 17:23:32,298 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-10-13 17:23:34,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1449704.6666666667, ans=0.125 2023-10-13 17:24:08,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1449798.0, ans=0.0 2023-10-13 17:24:24,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.08 vs. limit=15.0 2023-10-13 17:24:28,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1449844.6666666667, ans=0.125 2023-10-13 17:24:32,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1449844.6666666667, ans=0.0 2023-10-13 17:24:39,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1449891.3333333333, ans=0.0 2023-10-13 17:24:55,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1449938.0, ans=0.0 2023-10-13 17:25:05,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1449984.6666666667, ans=0.125 2023-10-13 17:25:19,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1450031.3333333333, ans=0.1 2023-10-13 17:25:25,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.855e+02 2.046e+02 2.260e+02 4.458e+02, threshold=4.092e+02, percent-clipped=1.0 2023-10-13 17:25:29,708 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=12.0 2023-10-13 17:25:30,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1450078.0, ans=0.0 2023-10-13 17:25:39,290 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.13 vs. limit=15.0 2023-10-13 17:25:56,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1450171.3333333333, ans=0.125 2023-10-13 17:26:03,416 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:26:05,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1450171.3333333333, ans=0.0 2023-10-13 17:26:16,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1450218.0, ans=0.0 2023-10-13 17:26:19,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1450264.6666666667, ans=0.0 2023-10-13 17:26:21,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=1450264.6666666667, ans=15.0 2023-10-13 17:26:32,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1450311.3333333333, ans=0.125 2023-10-13 17:26:37,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1450311.3333333333, ans=0.125 2023-10-13 17:26:49,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.50 vs. limit=15.0 2023-10-13 17:26:53,397 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:26:55,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1450358.0, ans=0.125 2023-10-13 17:27:10,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1450451.3333333333, ans=0.125 2023-10-13 17:27:14,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1450451.3333333333, ans=0.5 2023-10-13 17:27:15,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1450451.3333333333, ans=0.2 2023-10-13 17:27:30,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.766e+02 1.929e+02 2.042e+02 2.650e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-13 17:27:44,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1450544.6666666667, ans=0.125 2023-10-13 17:27:51,568 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.80 vs. limit=10.0 2023-10-13 17:28:04,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1450638.0, ans=0.125 2023-10-13 17:28:08,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1450638.0, ans=0.125 2023-10-13 17:28:08,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1450638.0, ans=0.07 2023-10-13 17:28:11,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.72 vs. limit=15.0 2023-10-13 17:28:13,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1450684.6666666667, ans=0.1 2023-10-13 17:28:14,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1450684.6666666667, ans=10.0 2023-10-13 17:28:17,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.35 vs. limit=15.0 2023-10-13 17:28:27,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1450731.3333333333, ans=0.125 2023-10-13 17:28:28,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1450731.3333333333, ans=0.2 2023-10-13 17:28:30,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1450731.3333333333, ans=0.125 2023-10-13 17:28:45,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1450778.0, ans=0.0 2023-10-13 17:28:49,158 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2023-10-13 17:29:07,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=8.0 2023-10-13 17:29:08,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1450871.3333333333, ans=0.025 2023-10-13 17:29:18,820 INFO [train.py:1031] (0/4) Epoch 23, batch 10500, loss[loss=0.2282, simple_loss=0.308, pruned_loss=0.07423, over 16580.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2801, pruned_loss=0.048, over 32624338.52 frames. ], batch size: 266, lr: 1.49e-03, grad_scale: 16.0 2023-10-13 17:29:37,040 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.822e+02 2.028e+02 2.260e+02 3.465e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-13 17:29:43,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1451011.3333333333, ans=0.05 2023-10-13 17:29:46,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1451011.3333333333, ans=0.125 2023-10-13 17:30:08,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1451058.0, ans=0.0 2023-10-13 17:30:20,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1451151.3333333333, ans=0.125 2023-10-13 17:30:25,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1451151.3333333333, ans=0.0 2023-10-13 17:30:26,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1451151.3333333333, ans=0.125 2023-10-13 17:30:32,424 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:30:32,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1451151.3333333333, ans=0.0 2023-10-13 17:30:35,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1451198.0, ans=0.0 2023-10-13 17:30:53,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1451244.6666666667, ans=0.125 2023-10-13 17:31:19,145 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=15.0 2023-10-13 17:31:29,528 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.85 vs. limit=22.5 2023-10-13 17:31:31,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1451338.0, ans=0.0 2023-10-13 17:31:34,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1451384.6666666667, ans=0.1 2023-10-13 17:31:52,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1451431.3333333333, ans=0.04949747468305833 2023-10-13 17:31:53,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.808e+02 1.990e+02 2.117e+02 2.887e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-13 17:31:55,182 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.32 vs. limit=15.0 2023-10-13 17:31:56,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1451478.0, ans=0.07 2023-10-13 17:32:20,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1451524.6666666667, ans=0.0 2023-10-13 17:33:19,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1451758.0, ans=0.0 2023-10-13 17:33:22,066 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.92 vs. limit=22.5 2023-10-13 17:33:23,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-10-13 17:33:40,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1451804.6666666667, ans=0.2 2023-10-13 17:33:42,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1451804.6666666667, ans=0.0 2023-10-13 17:33:43,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1451804.6666666667, ans=0.125 2023-10-13 17:33:54,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1451851.3333333333, ans=0.0 2023-10-13 17:34:13,672 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.796e+02 1.902e+02 2.096e+02 2.558e+02, threshold=3.804e+02, percent-clipped=0.0 2023-10-13 17:34:32,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1451991.3333333333, ans=0.125 2023-10-13 17:34:34,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1451991.3333333333, ans=0.0 2023-10-13 17:35:10,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1452131.3333333333, ans=0.125 2023-10-13 17:35:24,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1452131.3333333333, ans=0.125 2023-10-13 17:35:31,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1452178.0, ans=0.125 2023-10-13 17:35:34,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1452178.0, ans=0.125 2023-10-13 17:35:37,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1452178.0, ans=0.125 2023-10-13 17:36:18,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1452364.6666666667, ans=0.125 2023-10-13 17:36:20,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1452364.6666666667, ans=0.2 2023-10-13 17:36:27,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.870e+02 2.074e+02 2.380e+02 3.317e+02, threshold=4.149e+02, percent-clipped=0.0 2023-10-13 17:36:54,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1452504.6666666667, ans=0.1 2023-10-13 17:36:56,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1452504.6666666667, ans=0.0 2023-10-13 17:36:58,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.70 vs. limit=22.5 2023-10-13 17:37:06,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1452551.3333333333, ans=0.125 2023-10-13 17:37:08,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1452551.3333333333, ans=0.125 2023-10-13 17:37:12,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1452551.3333333333, ans=0.1 2023-10-13 17:37:16,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-10-13 17:37:38,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1452644.6666666667, ans=0.0 2023-10-13 17:37:45,118 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:37:49,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1452691.3333333333, ans=0.125 2023-10-13 17:37:49,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1452691.3333333333, ans=0.1 2023-10-13 17:38:01,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1452738.0, ans=0.2 2023-10-13 17:38:14,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1452784.6666666667, ans=0.2 2023-10-13 17:38:27,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1452831.3333333333, ans=0.125 2023-10-13 17:38:28,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1452831.3333333333, ans=0.125 2023-10-13 17:38:32,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1452831.3333333333, ans=0.2 2023-10-13 17:38:34,610 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.713e+02 1.903e+02 2.158e+02 3.027e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-13 17:39:42,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1453111.3333333333, ans=0.0 2023-10-13 17:39:55,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1453158.0, ans=0.125 2023-10-13 17:39:59,552 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.66 vs. limit=15.0 2023-10-13 17:40:01,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453158.0, ans=0.1 2023-10-13 17:40:19,918 INFO [train.py:1031] (0/4) Epoch 23, batch 11000, loss[loss=0.1756, simple_loss=0.2674, pruned_loss=0.04185, over 16605.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2802, pruned_loss=0.04792, over 32687872.60 frames. ], batch size: 61, lr: 1.49e-03, grad_scale: 16.0 2023-10-13 17:40:30,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453298.0, ans=0.1 2023-10-13 17:40:40,766 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.737e+02 1.956e+02 2.114e+02 2.751e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-13 17:40:43,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1453344.6666666667, ans=0.1 2023-10-13 17:40:50,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453344.6666666667, ans=0.1 2023-10-13 17:41:19,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453438.0, ans=0.1 2023-10-13 17:41:29,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1453484.6666666667, ans=0.125 2023-10-13 17:41:41,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1453531.3333333333, ans=0.125 2023-10-13 17:41:50,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1453578.0, ans=0.125 2023-10-13 17:42:05,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453624.6666666667, ans=0.1 2023-10-13 17:42:45,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1453764.6666666667, ans=0.0 2023-10-13 17:42:54,963 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.794e+02 1.906e+02 2.126e+02 2.639e+02, threshold=3.811e+02, percent-clipped=0.0 2023-10-13 17:42:59,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1453811.3333333333, ans=0.2 2023-10-13 17:43:12,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453858.0, ans=0.1 2023-10-13 17:43:23,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.62 vs. limit=12.0 2023-10-13 17:43:34,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1453951.3333333333, ans=0.025 2023-10-13 17:44:05,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1454044.6666666667, ans=0.125 2023-10-13 17:44:23,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1454091.3333333333, ans=0.1 2023-10-13 17:44:32,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1454138.0, ans=0.125 2023-10-13 17:44:50,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1454184.6666666667, ans=0.125 2023-10-13 17:44:53,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1454184.6666666667, ans=0.05 2023-10-13 17:44:55,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1454184.6666666667, ans=0.125 2023-10-13 17:45:04,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1454231.3333333333, ans=0.125 2023-10-13 17:45:06,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.722e+02 1.876e+02 2.078e+02 2.779e+02, threshold=3.751e+02, percent-clipped=0.0 2023-10-13 17:45:16,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1454278.0, ans=0.95 2023-10-13 17:45:16,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1454278.0, ans=0.2 2023-10-13 17:45:29,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1454324.6666666667, ans=0.0 2023-10-13 17:45:39,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1454371.3333333333, ans=0.125 2023-10-13 17:45:43,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1454418.0, ans=0.0 2023-10-13 17:45:58,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1454464.6666666667, ans=0.0 2023-10-13 17:46:03,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1454464.6666666667, ans=0.1 2023-10-13 17:46:06,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1454464.6666666667, ans=0.125 2023-10-13 17:46:20,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1454511.3333333333, ans=0.1 2023-10-13 17:47:10,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1454698.0, ans=0.1 2023-10-13 17:47:17,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1454698.0, ans=0.125 2023-10-13 17:47:24,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.783e+02 1.972e+02 2.199e+02 2.949e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-13 17:47:27,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1454744.6666666667, ans=0.1 2023-10-13 17:47:50,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1454791.3333333333, ans=0.2 2023-10-13 17:47:51,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1454791.3333333333, ans=0.5 2023-10-13 17:48:27,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1454931.3333333333, ans=0.125 2023-10-13 17:48:30,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1454978.0, ans=10.0 2023-10-13 17:48:34,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1454978.0, ans=0.125 2023-10-13 17:48:37,323 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-10-13 17:48:57,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1455071.3333333333, ans=0.125 2023-10-13 17:49:12,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1455118.0, ans=0.0 2023-10-13 17:49:28,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1455164.6666666667, ans=15.0 2023-10-13 17:49:31,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.771e+02 1.956e+02 2.311e+02 3.174e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-13 17:49:36,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1455211.3333333333, ans=0.125 2023-10-13 17:49:44,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1455211.3333333333, ans=0.125 2023-10-13 17:49:47,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.83 vs. limit=15.0 2023-10-13 17:49:59,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.86 vs. limit=10.0 2023-10-13 17:50:03,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1455304.6666666667, ans=0.125 2023-10-13 17:50:22,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1455351.3333333333, ans=0.0 2023-10-13 17:51:19,827 INFO [train.py:1031] (0/4) Epoch 23, batch 11500, loss[loss=0.1769, simple_loss=0.2757, pruned_loss=0.03912, over 16289.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2797, pruned_loss=0.04769, over 32675725.94 frames. ], batch size: 50, lr: 1.49e-03, grad_scale: 32.0 2023-10-13 17:51:32,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1455631.3333333333, ans=0.0 2023-10-13 17:51:41,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.944e+02 2.132e+02 2.376e+02 3.898e+02, threshold=4.264e+02, percent-clipped=0.0 2023-10-13 17:51:43,034 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 17:51:44,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1455678.0, ans=0.0 2023-10-13 17:51:56,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1455724.6666666667, ans=0.1 2023-10-13 17:52:04,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1455724.6666666667, ans=0.125 2023-10-13 17:52:10,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1455771.3333333333, ans=0.2 2023-10-13 17:52:23,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1455818.0, ans=0.125 2023-10-13 17:52:41,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1455864.6666666667, ans=0.125 2023-10-13 17:52:46,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1455911.3333333333, ans=0.0 2023-10-13 17:53:11,706 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-312000.pt 2023-10-13 17:53:27,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456004.6666666667, ans=0.1 2023-10-13 17:53:53,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.812e+02 1.990e+02 2.231e+02 3.222e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-13 17:53:59,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1456144.6666666667, ans=0.1 2023-10-13 17:54:22,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1456238.0, ans=0.2 2023-10-13 17:54:24,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1456238.0, ans=0.0 2023-10-13 17:54:56,462 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-13 17:54:59,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1456378.0, ans=0.07 2023-10-13 17:55:15,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1456424.6666666667, ans=0.0 2023-10-13 17:55:28,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1456471.3333333333, ans=0.0 2023-10-13 17:55:38,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1456518.0, ans=0.09899494936611666 2023-10-13 17:55:44,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.04 vs. limit=15.0 2023-10-13 17:55:48,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1456564.6666666667, ans=0.125 2023-10-13 17:55:52,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.817e+02 1.994e+02 2.287e+02 3.245e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-13 17:56:00,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1456611.3333333333, ans=0.125 2023-10-13 17:56:16,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1456658.0, ans=0.0 2023-10-13 17:56:24,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1456704.6666666667, ans=15.0 2023-10-13 17:56:41,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1456751.3333333333, ans=0.09899494936611666 2023-10-13 17:56:41,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.12 vs. limit=15.0 2023-10-13 17:56:44,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1456751.3333333333, ans=0.0 2023-10-13 17:57:02,389 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-10-13 17:57:46,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1456984.6666666667, ans=0.125 2023-10-13 17:57:47,208 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=15.0 2023-10-13 17:57:47,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456984.6666666667, ans=0.1 2023-10-13 17:57:51,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1456984.6666666667, ans=0.07 2023-10-13 17:57:53,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1456984.6666666667, ans=0.0 2023-10-13 17:58:02,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1457031.3333333333, ans=0.125 2023-10-13 17:58:11,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.726e+02 1.960e+02 2.165e+02 3.007e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-13 17:58:22,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1457078.0, ans=0.1 2023-10-13 17:58:30,246 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=22.5 2023-10-13 17:58:39,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1457171.3333333333, ans=0.0 2023-10-13 17:59:01,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1457218.0, ans=0.2 2023-10-13 17:59:07,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1457264.6666666667, ans=0.0 2023-10-13 17:59:55,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1457451.3333333333, ans=0.025 2023-10-13 18:00:20,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.776e+02 1.885e+02 2.090e+02 2.844e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-13 18:00:55,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.65 vs. limit=15.0 2023-10-13 18:01:21,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1457731.3333333333, ans=0.125 2023-10-13 18:01:39,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1457778.0, ans=0.125 2023-10-13 18:01:56,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.44 vs. limit=15.0 2023-10-13 18:01:56,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1457824.6666666667, ans=0.125 2023-10-13 18:02:16,800 INFO [train.py:1031] (0/4) Epoch 23, batch 12000, loss[loss=0.1882, simple_loss=0.276, pruned_loss=0.05024, over 15570.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.28, pruned_loss=0.04756, over 32713958.72 frames. ], batch size: 35, lr: 1.49e-03, grad_scale: 32.0 2023-10-13 18:02:21,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1457918.0, ans=0.125 2023-10-13 18:02:22,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=22.5 2023-10-13 18:02:43,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.821e+02 1.990e+02 2.270e+02 2.834e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-13 18:02:46,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1458011.3333333333, ans=0.125 2023-10-13 18:03:08,514 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.47 vs. limit=15.0 2023-10-13 18:03:11,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1458058.0, ans=0.125 2023-10-13 18:04:14,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1458291.3333333333, ans=0.0 2023-10-13 18:04:28,873 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.85 vs. limit=15.0 2023-10-13 18:04:31,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1458384.6666666667, ans=0.2 2023-10-13 18:04:38,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1458384.6666666667, ans=0.125 2023-10-13 18:04:44,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1458431.3333333333, ans=0.1 2023-10-13 18:04:46,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1458431.3333333333, ans=0.0 2023-10-13 18:04:50,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-10-13 18:04:51,384 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.690e+02 1.823e+02 2.028e+02 3.202e+02, threshold=3.646e+02, percent-clipped=0.0 2023-10-13 18:04:57,412 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-10-13 18:04:58,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1458478.0, ans=0.125 2023-10-13 18:05:28,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1458571.3333333333, ans=15.0 2023-10-13 18:05:50,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1458664.6666666667, ans=0.1 2023-10-13 18:05:58,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1458711.3333333333, ans=0.125 2023-10-13 18:06:03,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1458711.3333333333, ans=0.125 2023-10-13 18:06:08,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1458758.0, ans=0.0 2023-10-13 18:06:09,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1458758.0, ans=0.1 2023-10-13 18:06:33,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1458851.3333333333, ans=0.05 2023-10-13 18:06:38,199 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.65 vs. limit=22.5 2023-10-13 18:06:47,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1458898.0, ans=0.025 2023-10-13 18:06:51,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1458898.0, ans=0.125 2023-10-13 18:06:54,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.777e+02 1.943e+02 2.086e+02 2.933e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-13 18:06:58,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1458944.6666666667, ans=0.125 2023-10-13 18:07:02,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1458944.6666666667, ans=0.0 2023-10-13 18:07:06,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1458944.6666666667, ans=0.125 2023-10-13 18:07:13,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1458991.3333333333, ans=0.0 2023-10-13 18:07:17,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1458991.3333333333, ans=0.0 2023-10-13 18:07:43,706 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=12.0 2023-10-13 18:07:48,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1459084.6666666667, ans=0.1 2023-10-13 18:08:23,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1459224.6666666667, ans=0.125 2023-10-13 18:08:32,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1459271.3333333333, ans=0.125 2023-10-13 18:08:39,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1459318.0, ans=0.125 2023-10-13 18:08:41,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1459318.0, ans=0.0 2023-10-13 18:08:41,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1459318.0, ans=0.125 2023-10-13 18:08:41,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1459318.0, ans=0.125 2023-10-13 18:08:46,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1459318.0, ans=0.1 2023-10-13 18:09:00,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.840e+02 2.088e+02 2.354e+02 3.662e+02, threshold=4.176e+02, percent-clipped=0.0 2023-10-13 18:09:09,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1459411.3333333333, ans=0.0 2023-10-13 18:09:13,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1459458.0, ans=0.125 2023-10-13 18:09:15,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1459458.0, ans=0.2 2023-10-13 18:09:41,449 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:09:42,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.84 vs. limit=15.0 2023-10-13 18:10:05,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1459644.6666666667, ans=0.2 2023-10-13 18:10:11,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1459644.6666666667, ans=0.125 2023-10-13 18:10:43,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1459784.6666666667, ans=0.125 2023-10-13 18:10:53,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1459784.6666666667, ans=0.125 2023-10-13 18:11:00,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1459831.3333333333, ans=0.0 2023-10-13 18:11:01,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.72 vs. limit=15.0 2023-10-13 18:11:03,280 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-10-13 18:11:05,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-10-13 18:11:07,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.812e+02 1.961e+02 2.176e+02 3.430e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-13 18:11:17,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1459878.0, ans=0.125 2023-10-13 18:11:27,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1459924.6666666667, ans=0.125 2023-10-13 18:11:38,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1459971.3333333333, ans=0.125 2023-10-13 18:11:46,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1459971.3333333333, ans=0.2 2023-10-13 18:12:17,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1460111.3333333333, ans=0.1 2023-10-13 18:12:21,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1460111.3333333333, ans=0.0 2023-10-13 18:12:34,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1460158.0, ans=0.125 2023-10-13 18:12:52,063 INFO [train.py:1031] (0/4) Epoch 23, batch 12500, loss[loss=0.1804, simple_loss=0.2748, pruned_loss=0.04306, over 16735.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2797, pruned_loss=0.04752, over 32757783.31 frames. ], batch size: 202, lr: 1.48e-03, grad_scale: 32.0 2023-10-13 18:13:15,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.739e+02 1.875e+02 2.029e+02 2.568e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-13 18:13:17,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1460344.6666666667, ans=0.125 2023-10-13 18:13:17,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1460344.6666666667, ans=0.125 2023-10-13 18:13:42,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1460438.0, ans=0.0 2023-10-13 18:13:57,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1460484.6666666667, ans=0.125 2023-10-13 18:13:59,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1460484.6666666667, ans=0.0 2023-10-13 18:14:17,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1460578.0, ans=0.125 2023-10-13 18:14:42,799 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:15:00,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1460718.0, ans=0.0 2023-10-13 18:15:14,711 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.810e+02 1.919e+02 2.200e+02 3.662e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-13 18:15:15,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1460811.3333333333, ans=0.1 2023-10-13 18:15:30,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1460858.0, ans=0.0 2023-10-13 18:15:31,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.40 vs. limit=22.5 2023-10-13 18:15:46,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1460904.6666666667, ans=0.125 2023-10-13 18:15:51,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1460904.6666666667, ans=0.2 2023-10-13 18:16:05,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1460998.0, ans=0.2 2023-10-13 18:16:44,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1461091.3333333333, ans=0.0 2023-10-13 18:16:46,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1461138.0, ans=0.125 2023-10-13 18:16:55,744 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:17:01,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1461184.6666666667, ans=0.0 2023-10-13 18:17:03,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1461184.6666666667, ans=0.0 2023-10-13 18:17:19,866 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.782e+02 1.997e+02 2.272e+02 3.404e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-13 18:17:21,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1461278.0, ans=0.125 2023-10-13 18:17:24,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1461278.0, ans=0.125 2023-10-13 18:17:27,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1461278.0, ans=0.125 2023-10-13 18:17:39,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1461324.6666666667, ans=0.0 2023-10-13 18:17:39,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1461324.6666666667, ans=0.1 2023-10-13 18:17:43,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1461371.3333333333, ans=0.0 2023-10-13 18:17:47,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1461371.3333333333, ans=0.0 2023-10-13 18:17:52,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1461371.3333333333, ans=0.125 2023-10-13 18:18:41,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1461558.0, ans=0.125 2023-10-13 18:18:47,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1461604.6666666667, ans=0.0 2023-10-13 18:18:48,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1461604.6666666667, ans=0.1 2023-10-13 18:18:56,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1461651.3333333333, ans=0.0 2023-10-13 18:19:08,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1461698.0, ans=0.0 2023-10-13 18:19:18,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.780e+02 2.022e+02 2.250e+02 3.099e+02, threshold=4.044e+02, percent-clipped=0.0 2023-10-13 18:19:23,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-10-13 18:19:26,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1461744.6666666667, ans=0.0 2023-10-13 18:20:07,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1461884.6666666667, ans=0.0 2023-10-13 18:20:16,863 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-10-13 18:20:49,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1462071.3333333333, ans=0.125 2023-10-13 18:21:02,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=12.0 2023-10-13 18:21:02,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.63 vs. limit=10.0 2023-10-13 18:21:13,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1462164.6666666667, ans=0.125 2023-10-13 18:21:14,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1462164.6666666667, ans=0.0 2023-10-13 18:21:21,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.714e+02 1.898e+02 2.134e+02 3.154e+02, threshold=3.797e+02, percent-clipped=0.0 2023-10-13 18:21:36,552 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.40 vs. limit=15.0 2023-10-13 18:21:40,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1462258.0, ans=0.2 2023-10-13 18:21:53,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1462304.6666666667, ans=0.0 2023-10-13 18:22:02,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1462351.3333333333, ans=0.125 2023-10-13 18:22:07,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1462398.0, ans=0.1 2023-10-13 18:22:08,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1462398.0, ans=0.2 2023-10-13 18:22:14,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1462398.0, ans=0.125 2023-10-13 18:22:54,081 INFO [train.py:1031] (0/4) Epoch 23, batch 13000, loss[loss=0.1896, simple_loss=0.279, pruned_loss=0.0501, over 16835.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2804, pruned_loss=0.0477, over 32781383.84 frames. ], batch size: 188, lr: 1.48e-03, grad_scale: 32.0 2023-10-13 18:23:04,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.11 vs. limit=6.0 2023-10-13 18:23:11,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1462631.3333333333, ans=0.125 2023-10-13 18:23:16,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.745e+02 1.904e+02 2.101e+02 2.810e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-13 18:23:49,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1462771.3333333333, ans=0.125 2023-10-13 18:24:38,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-10-13 18:24:59,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1463004.6666666667, ans=0.125 2023-10-13 18:25:02,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1463004.6666666667, ans=0.125 2023-10-13 18:25:05,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1463051.3333333333, ans=0.125 2023-10-13 18:25:14,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1463051.3333333333, ans=0.125 2023-10-13 18:25:15,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1463051.3333333333, ans=0.125 2023-10-13 18:25:27,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1463098.0, ans=0.0 2023-10-13 18:25:28,742 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.826e+02 2.078e+02 2.393e+02 3.197e+02, threshold=4.156e+02, percent-clipped=0.0 2023-10-13 18:25:35,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1463144.6666666667, ans=0.07 2023-10-13 18:25:58,456 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.16 vs. limit=15.0 2023-10-13 18:26:07,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1463238.0, ans=0.125 2023-10-13 18:26:21,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1463331.3333333333, ans=0.125 2023-10-13 18:26:51,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1463424.6666666667, ans=10.0 2023-10-13 18:27:09,922 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:27:24,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1463518.0, ans=10.0 2023-10-13 18:27:43,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.774e+02 1.939e+02 2.173e+02 2.766e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-13 18:27:44,568 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.46 vs. limit=5.0 2023-10-13 18:27:53,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=1463611.3333333333, ans=0.2 2023-10-13 18:28:07,950 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-10-13 18:28:26,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1463751.3333333333, ans=0.1 2023-10-13 18:28:28,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1463751.3333333333, ans=0.125 2023-10-13 18:28:28,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1463751.3333333333, ans=0.2 2023-10-13 18:28:34,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1463798.0, ans=0.0 2023-10-13 18:28:34,949 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-10-13 18:28:37,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1463798.0, ans=0.2 2023-10-13 18:28:59,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1463891.3333333333, ans=0.05 2023-10-13 18:29:06,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1463891.3333333333, ans=0.125 2023-10-13 18:29:22,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1463938.0, ans=0.05 2023-10-13 18:29:37,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1464031.3333333333, ans=0.0 2023-10-13 18:29:50,733 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.736e+02 1.970e+02 2.207e+02 3.243e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-13 18:30:19,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1464171.3333333333, ans=0.125 2023-10-13 18:30:25,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1464218.0, ans=0.125 2023-10-13 18:30:35,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.41 vs. limit=22.5 2023-10-13 18:30:37,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1464264.6666666667, ans=10.0 2023-10-13 18:30:54,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1464311.3333333333, ans=0.1 2023-10-13 18:31:22,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1464404.6666666667, ans=0.125 2023-10-13 18:31:24,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1464451.3333333333, ans=0.2 2023-10-13 18:31:49,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.784e+02 1.992e+02 2.226e+02 2.997e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-13 18:31:49,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1464544.6666666667, ans=0.0 2023-10-13 18:32:00,424 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2023-10-13 18:32:01,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1464591.3333333333, ans=0.125 2023-10-13 18:32:25,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1464684.6666666667, ans=0.1 2023-10-13 18:32:25,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.81 vs. limit=22.5 2023-10-13 18:32:40,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1464731.3333333333, ans=0.125 2023-10-13 18:32:45,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1464731.3333333333, ans=0.1 2023-10-13 18:33:04,036 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.29 vs. limit=15.0 2023-10-13 18:33:09,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1464871.3333333333, ans=0.07 2023-10-13 18:33:10,539 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.09 vs. limit=15.0 2023-10-13 18:33:13,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1464871.3333333333, ans=0.125 2023-10-13 18:33:21,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1464918.0, ans=0.125 2023-10-13 18:33:22,236 INFO [train.py:1031] (0/4) Epoch 23, batch 13500, loss[loss=0.193, simple_loss=0.2821, pruned_loss=0.05196, over 16469.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2797, pruned_loss=0.04751, over 32798463.25 frames. ], batch size: 266, lr: 1.48e-03, grad_scale: 16.0 2023-10-13 18:33:26,236 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-13 18:33:43,727 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.90 vs. limit=22.5 2023-10-13 18:33:43,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.793e+02 1.924e+02 2.202e+02 3.691e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-13 18:33:44,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1465011.3333333333, ans=0.1 2023-10-13 18:33:50,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1465011.3333333333, ans=0.0 2023-10-13 18:34:04,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-10-13 18:34:13,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1465104.6666666667, ans=0.125 2023-10-13 18:34:14,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1465104.6666666667, ans=0.2 2023-10-13 18:34:23,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1465151.3333333333, ans=0.125 2023-10-13 18:34:50,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1465244.6666666667, ans=0.035 2023-10-13 18:35:27,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1465384.6666666667, ans=0.0 2023-10-13 18:35:30,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1465384.6666666667, ans=0.0 2023-10-13 18:35:38,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1465431.3333333333, ans=0.125 2023-10-13 18:35:45,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.799e+02 1.943e+02 2.187e+02 3.695e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-13 18:35:50,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1465478.0, ans=0.1 2023-10-13 18:36:01,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1465524.6666666667, ans=0.05 2023-10-13 18:36:15,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1465618.0, ans=0.125 2023-10-13 18:36:21,044 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-23.pt 2023-10-13 18:36:50,100 INFO [train.py:1031] (0/4) Epoch 24, batch 0, loss[loss=0.1741, simple_loss=0.266, pruned_loss=0.04111, over 15301.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.266, pruned_loss=0.04111, over 15301.00 frames. ], batch size: 35, lr: 1.45e-03, grad_scale: 32.0 2023-10-13 18:36:50,102 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-13 18:36:58,757 INFO [train.py:1063] (0/4) Epoch 24, validation: loss=0.2142, simple_loss=0.3011, pruned_loss=0.06363, over 1020973.00 frames. 2023-10-13 18:36:58,758 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-13 18:37:13,292 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:37:17,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1465688.0, ans=0.0 2023-10-13 18:37:19,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1465688.0, ans=0.1 2023-10-13 18:37:20,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.63 vs. limit=15.0 2023-10-13 18:37:22,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1465688.0, ans=0.2 2023-10-13 18:37:43,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1465781.3333333333, ans=0.125 2023-10-13 18:38:03,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1465874.6666666667, ans=0.125 2023-10-13 18:38:09,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1465874.6666666667, ans=0.125 2023-10-13 18:38:13,801 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.47 vs. limit=12.0 2023-10-13 18:38:17,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1465921.3333333333, ans=0.1 2023-10-13 18:38:17,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1465921.3333333333, ans=0.0 2023-10-13 18:38:20,858 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.806e+02 1.945e+02 2.264e+02 4.686e+02, threshold=3.890e+02, percent-clipped=3.0 2023-10-13 18:38:46,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-13 18:38:46,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1466014.6666666667, ans=0.1 2023-10-13 18:39:20,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1466154.6666666667, ans=0.125 2023-10-13 18:39:34,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.39 vs. limit=15.0 2023-10-13 18:39:35,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1466201.3333333333, ans=0.125 2023-10-13 18:39:41,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1466248.0, ans=10.0 2023-10-13 18:39:44,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1466248.0, ans=0.125 2023-10-13 18:40:25,619 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.727e+02 1.913e+02 2.109e+02 3.118e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-13 18:40:37,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1466434.6666666667, ans=0.125 2023-10-13 18:40:48,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1466481.3333333333, ans=0.125 2023-10-13 18:40:49,801 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.40 vs. limit=22.5 2023-10-13 18:41:01,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1466528.0, ans=6.0 2023-10-13 18:41:03,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1466528.0, ans=0.125 2023-10-13 18:41:19,494 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.47 vs. limit=15.0 2023-10-13 18:41:51,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1466714.6666666667, ans=0.1 2023-10-13 18:41:55,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1466714.6666666667, ans=0.0 2023-10-13 18:41:56,207 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-10-13 18:42:09,413 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.14 vs. limit=15.0 2023-10-13 18:42:35,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1466854.6666666667, ans=0.2 2023-10-13 18:42:36,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.868e+02 2.065e+02 2.377e+02 3.183e+02, threshold=4.129e+02, percent-clipped=0.0 2023-10-13 18:42:45,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1466901.3333333333, ans=0.1 2023-10-13 18:42:46,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1466901.3333333333, ans=0.0 2023-10-13 18:43:01,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1466948.0, ans=0.2 2023-10-13 18:43:36,193 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 18:43:46,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1467088.0, ans=0.0 2023-10-13 18:44:17,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1467228.0, ans=0.07 2023-10-13 18:44:18,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1467228.0, ans=0.125 2023-10-13 18:44:31,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1467274.6666666667, ans=0.2 2023-10-13 18:44:52,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1467321.3333333333, ans=0.1 2023-10-13 18:44:54,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.781e+02 1.937e+02 2.175e+02 2.924e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-13 18:44:56,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1467321.3333333333, ans=0.0 2023-10-13 18:44:58,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1467368.0, ans=0.1 2023-10-13 18:45:04,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1467368.0, ans=0.0 2023-10-13 18:45:09,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1467414.6666666667, ans=0.125 2023-10-13 18:45:14,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=22.5 2023-10-13 18:45:20,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.60 vs. limit=15.0 2023-10-13 18:45:22,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1467461.3333333333, ans=0.0 2023-10-13 18:45:39,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1467508.0, ans=0.125 2023-10-13 18:45:41,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1467508.0, ans=0.125 2023-10-13 18:46:05,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1467648.0, ans=0.1 2023-10-13 18:46:07,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1467648.0, ans=0.1 2023-10-13 18:46:11,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1467648.0, ans=0.125 2023-10-13 18:46:41,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1467741.3333333333, ans=0.0 2023-10-13 18:46:57,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.787e+02 1.967e+02 2.229e+02 3.511e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-13 18:47:22,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1467881.3333333333, ans=0.125 2023-10-13 18:47:25,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1467881.3333333333, ans=0.1 2023-10-13 18:47:35,233 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-10-13 18:47:37,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1467928.0, ans=0.125 2023-10-13 18:47:40,024 INFO [train.py:1031] (0/4) Epoch 24, batch 500, loss[loss=0.1747, simple_loss=0.2672, pruned_loss=0.04112, over 16833.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2806, pruned_loss=0.04807, over 7292648.22 frames. ], batch size: 146, lr: 1.45e-03, grad_scale: 16.0 2023-10-13 18:47:52,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1468021.3333333333, ans=0.0 2023-10-13 18:47:59,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1468021.3333333333, ans=0.1 2023-10-13 18:48:01,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1468021.3333333333, ans=0.05 2023-10-13 18:48:08,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1468068.0, ans=0.125 2023-10-13 18:48:22,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.08 vs. limit=12.0 2023-10-13 18:48:30,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1468161.3333333333, ans=0.125 2023-10-13 18:48:35,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1468161.3333333333, ans=0.04949747468305833 2023-10-13 18:48:55,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1468254.6666666667, ans=0.0 2023-10-13 18:49:01,969 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-13 18:49:02,503 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.802e+02 2.023e+02 2.284e+02 3.778e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-13 18:49:39,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1468394.6666666667, ans=0.2 2023-10-13 18:49:49,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.82 vs. limit=12.0 2023-10-13 18:50:18,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1468581.3333333333, ans=0.0 2023-10-13 18:50:24,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1468581.3333333333, ans=0.125 2023-10-13 18:50:33,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1468628.0, ans=0.1 2023-10-13 18:50:45,895 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=15.0 2023-10-13 18:50:49,350 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.60 vs. limit=15.0 2023-10-13 18:50:57,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1468721.3333333333, ans=0.2 2023-10-13 18:50:59,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1468721.3333333333, ans=0.0 2023-10-13 18:51:01,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.816e+02 1.990e+02 2.224e+02 3.389e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-13 18:51:33,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1468861.3333333333, ans=0.125 2023-10-13 18:51:34,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1468861.3333333333, ans=0.125 2023-10-13 18:51:39,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1468861.3333333333, ans=0.125 2023-10-13 18:52:01,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1468954.6666666667, ans=0.025 2023-10-13 18:52:26,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1469048.0, ans=0.125 2023-10-13 18:52:30,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1469048.0, ans=0.125 2023-10-13 18:52:40,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-10-13 18:53:00,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-10-13 18:53:07,038 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.816e+02 1.990e+02 2.236e+02 2.946e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-13 18:53:16,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1469234.6666666667, ans=0.1 2023-10-13 18:53:18,810 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.89 vs. limit=15.0 2023-10-13 18:53:25,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1469281.3333333333, ans=0.2 2023-10-13 18:53:33,231 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.36 vs. limit=15.0 2023-10-13 18:53:41,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1469328.0, ans=0.125 2023-10-13 18:53:43,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1469328.0, ans=0.1 2023-10-13 18:53:52,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1469374.6666666667, ans=0.1 2023-10-13 18:54:01,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1469374.6666666667, ans=0.125 2023-10-13 18:54:36,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1469514.6666666667, ans=0.125 2023-10-13 18:54:36,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1469514.6666666667, ans=0.0 2023-10-13 18:54:47,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1469561.3333333333, ans=0.0 2023-10-13 18:54:58,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1469608.0, ans=0.2 2023-10-13 18:55:00,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1469608.0, ans=0.0 2023-10-13 18:55:06,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1469608.0, ans=0.125 2023-10-13 18:55:14,216 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.22 vs. limit=22.5 2023-10-13 18:55:16,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1469654.6666666667, ans=0.2 2023-10-13 18:55:21,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1469654.6666666667, ans=0.0 2023-10-13 18:55:24,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.782e+02 1.952e+02 2.079e+02 2.989e+02, threshold=3.904e+02, percent-clipped=0.0 2023-10-13 18:55:24,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1469654.6666666667, ans=0.035 2023-10-13 18:55:28,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1469701.3333333333, ans=0.125 2023-10-13 18:55:31,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1469701.3333333333, ans=0.1 2023-10-13 18:55:38,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1469748.0, ans=0.125 2023-10-13 18:55:50,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1469748.0, ans=0.125 2023-10-13 18:56:06,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1469841.3333333333, ans=0.125 2023-10-13 18:56:11,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1469841.3333333333, ans=0.0 2023-10-13 18:56:24,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1469888.0, ans=0.125 2023-10-13 18:56:35,337 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-10-13 18:56:40,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1469981.3333333333, ans=0.125 2023-10-13 18:57:11,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1470074.6666666667, ans=0.2 2023-10-13 18:57:18,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=22.5 2023-10-13 18:57:26,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.783e+02 1.926e+02 2.116e+02 2.800e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-13 18:57:58,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1470261.3333333333, ans=0.0 2023-10-13 18:58:06,519 INFO [train.py:1031] (0/4) Epoch 24, batch 1000, loss[loss=0.1843, simple_loss=0.2498, pruned_loss=0.05938, over 12599.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2812, pruned_loss=0.0483, over 12949971.03 frames. ], batch size: 440, lr: 1.45e-03, grad_scale: 16.0 2023-10-13 18:58:23,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.41 vs. limit=22.5 2023-10-13 18:58:59,368 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-10-13 18:59:05,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1470541.3333333333, ans=0.5 2023-10-13 18:59:15,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1470588.0, ans=0.125 2023-10-13 18:59:20,495 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.824e+02 2.058e+02 2.355e+02 3.143e+02, threshold=4.115e+02, percent-clipped=0.0 2023-10-13 18:59:21,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1470634.6666666667, ans=0.015 2023-10-13 18:59:23,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1470634.6666666667, ans=0.125 2023-10-13 18:59:25,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1470634.6666666667, ans=0.1 2023-10-13 18:59:29,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1470634.6666666667, ans=0.125 2023-10-13 18:59:30,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1470634.6666666667, ans=0.0 2023-10-13 18:59:34,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1470681.3333333333, ans=0.0 2023-10-13 18:59:35,158 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-10-13 18:59:37,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1470681.3333333333, ans=0.0 2023-10-13 19:00:10,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1470821.3333333333, ans=0.1 2023-10-13 19:00:27,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.69 vs. limit=22.5 2023-10-13 19:00:33,973 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.21 vs. limit=22.5 2023-10-13 19:00:43,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1470961.3333333333, ans=0.0 2023-10-13 19:01:03,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1471008.0, ans=0.2 2023-10-13 19:01:03,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1471008.0, ans=0.0 2023-10-13 19:01:21,039 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.814e+02 2.002e+02 2.279e+02 3.571e+02, threshold=4.005e+02, percent-clipped=0.0 2023-10-13 19:01:35,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1471101.3333333333, ans=0.125 2023-10-13 19:01:52,482 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:02:58,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.94 vs. limit=12.0 2023-10-13 19:03:10,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1471474.6666666667, ans=0.1 2023-10-13 19:03:22,455 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.750e+02 1.969e+02 2.177e+02 2.925e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-13 19:03:30,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1471568.0, ans=0.1 2023-10-13 19:03:57,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-10-13 19:04:49,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1471941.3333333333, ans=0.2 2023-10-13 19:05:08,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.803e+02 2.027e+02 2.199e+02 3.657e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-13 19:05:20,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1472034.6666666667, ans=0.125 2023-10-13 19:05:24,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1472081.3333333333, ans=0.2 2023-10-13 19:05:25,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1472081.3333333333, ans=0.2 2023-10-13 19:05:31,174 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.11 vs. limit=10.0 2023-10-13 19:05:40,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1472128.0, ans=0.125 2023-10-13 19:05:43,139 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.12 vs. limit=15.0 2023-10-13 19:05:47,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1472174.6666666667, ans=0.125 2023-10-13 19:06:00,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1472174.6666666667, ans=0.125 2023-10-13 19:06:04,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1472221.3333333333, ans=0.2 2023-10-13 19:06:14,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1472221.3333333333, ans=0.125 2023-10-13 19:06:24,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1472268.0, ans=0.1 2023-10-13 19:06:27,066 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-10-13 19:06:41,734 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-10-13 19:06:48,527 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2023-10-13 19:07:11,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1472454.6666666667, ans=0.2 2023-10-13 19:07:14,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.785e+02 1.963e+02 2.244e+02 3.189e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-13 19:07:26,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1472501.3333333333, ans=0.125 2023-10-13 19:07:31,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1472548.0, ans=0.125 2023-10-13 19:07:43,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1472594.6666666667, ans=0.0 2023-10-13 19:07:43,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1472594.6666666667, ans=0.125 2023-10-13 19:07:53,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1472641.3333333333, ans=0.2 2023-10-13 19:07:55,287 INFO [train.py:1031] (0/4) Epoch 24, batch 1500, loss[loss=0.1984, simple_loss=0.2867, pruned_loss=0.05508, over 16781.00 frames. ], tot_loss[loss=0.187, simple_loss=0.279, pruned_loss=0.04752, over 17330766.10 frames. ], batch size: 175, lr: 1.45e-03, grad_scale: 32.0 2023-10-13 19:08:08,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1472688.0, ans=0.2 2023-10-13 19:08:08,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1472688.0, ans=0.125 2023-10-13 19:08:14,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1472688.0, ans=0.1 2023-10-13 19:08:14,921 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.81 vs. limit=22.5 2023-10-13 19:08:30,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1472781.3333333333, ans=0.125 2023-10-13 19:08:51,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.23 vs. limit=10.0 2023-10-13 19:08:52,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1472874.6666666667, ans=0.0 2023-10-13 19:08:59,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1472874.6666666667, ans=0.0 2023-10-13 19:09:15,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.860e+02 1.973e+02 2.294e+02 2.782e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-13 19:09:21,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1472968.0, ans=0.1 2023-10-13 19:09:22,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1472968.0, ans=0.2 2023-10-13 19:09:40,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-10-13 19:10:26,650 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=22.5 2023-10-13 19:10:37,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1473248.0, ans=0.125 2023-10-13 19:10:52,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1473294.6666666667, ans=0.125 2023-10-13 19:10:57,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1473341.3333333333, ans=0.0 2023-10-13 19:11:16,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1473388.0, ans=0.0 2023-10-13 19:11:20,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.772e+02 1.963e+02 2.196e+02 3.044e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-13 19:11:22,107 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-10-13 19:12:06,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1473621.3333333333, ans=0.125 2023-10-13 19:12:09,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1473621.3333333333, ans=0.125 2023-10-13 19:12:15,876 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.97 vs. limit=22.5 2023-10-13 19:12:37,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1473761.3333333333, ans=0.125 2023-10-13 19:13:01,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1473854.6666666667, ans=0.09899494936611666 2023-10-13 19:13:09,382 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.838e+02 1.969e+02 2.181e+02 3.183e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-13 19:13:16,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1473901.3333333333, ans=0.125 2023-10-13 19:13:21,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.66 vs. limit=22.5 2023-10-13 19:13:53,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1474041.3333333333, ans=0.5 2023-10-13 19:14:11,318 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-10-13 19:14:12,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1474134.6666666667, ans=0.0 2023-10-13 19:14:16,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1474134.6666666667, ans=0.5 2023-10-13 19:14:20,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1474134.6666666667, ans=0.0 2023-10-13 19:14:28,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1474181.3333333333, ans=10.0 2023-10-13 19:14:39,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1474228.0, ans=0.015 2023-10-13 19:14:40,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1474228.0, ans=0.0 2023-10-13 19:15:03,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1474274.6666666667, ans=0.025 2023-10-13 19:15:18,105 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.785e+02 2.024e+02 2.351e+02 3.117e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-13 19:15:41,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.39 vs. limit=15.0 2023-10-13 19:16:03,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.55 vs. limit=12.0 2023-10-13 19:16:04,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-10-13 19:16:14,608 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.51 vs. limit=22.5 2023-10-13 19:16:18,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1474601.3333333333, ans=0.1 2023-10-13 19:16:29,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1474648.0, ans=0.125 2023-10-13 19:16:31,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1474648.0, ans=0.125 2023-10-13 19:16:39,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1474648.0, ans=10.0 2023-10-13 19:16:42,555 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-10-13 19:16:50,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1474694.6666666667, ans=0.125 2023-10-13 19:17:25,130 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.794e+02 2.001e+02 2.229e+02 3.169e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-13 19:17:27,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1474834.6666666667, ans=0.95 2023-10-13 19:17:37,704 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:17:41,726 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:17:50,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1474928.0, ans=0.125 2023-10-13 19:17:52,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1474928.0, ans=0.125 2023-10-13 19:18:03,004 INFO [train.py:1031] (0/4) Epoch 24, batch 2000, loss[loss=0.1934, simple_loss=0.2908, pruned_loss=0.04805, over 16495.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2796, pruned_loss=0.04745, over 20770251.34 frames. ], batch size: 266, lr: 1.44e-03, grad_scale: 32.0 2023-10-13 19:18:09,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1474974.6666666667, ans=0.1 2023-10-13 19:18:11,940 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-10-13 19:18:15,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1475021.3333333333, ans=0.07 2023-10-13 19:18:51,562 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:19:02,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1475161.3333333333, ans=0.1 2023-10-13 19:19:07,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1475161.3333333333, ans=0.2 2023-10-13 19:19:08,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.55 vs. limit=22.5 2023-10-13 19:19:16,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1475208.0, ans=0.95 2023-10-13 19:19:37,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475301.3333333333, ans=0.1 2023-10-13 19:19:40,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.703e+02 1.840e+02 2.148e+02 2.927e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-13 19:19:51,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475348.0, ans=0.1 2023-10-13 19:19:55,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1475348.0, ans=0.0 2023-10-13 19:20:03,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1475394.6666666667, ans=0.0 2023-10-13 19:20:11,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-10-13 19:20:37,787 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1475488.0, ans=0.125 2023-10-13 19:21:31,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1475628.0, ans=0.125 2023-10-13 19:22:17,235 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.866e+02 2.031e+02 2.269e+02 3.413e+02, threshold=4.062e+02, percent-clipped=0.0 2023-10-13 19:22:18,742 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.50 vs. limit=15.0 2023-10-13 19:22:24,027 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:22:26,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1475814.6666666667, ans=0.125 2023-10-13 19:22:31,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-10-13 19:22:36,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1475814.6666666667, ans=0.0 2023-10-13 19:23:21,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1476001.3333333333, ans=0.2 2023-10-13 19:23:21,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1476001.3333333333, ans=0.025 2023-10-13 19:23:23,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1476001.3333333333, ans=0.1 2023-10-13 19:23:33,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.11 vs. limit=22.5 2023-10-13 19:23:51,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1476141.3333333333, ans=0.2 2023-10-13 19:24:16,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.848e+02 1.982e+02 2.300e+02 3.427e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-13 19:24:39,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1476328.0, ans=0.0 2023-10-13 19:24:39,264 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.88 vs. limit=15.0 2023-10-13 19:24:59,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1476374.6666666667, ans=0.125 2023-10-13 19:25:04,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-10-13 19:25:10,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1476421.3333333333, ans=0.0 2023-10-13 19:25:37,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1476561.3333333333, ans=0.125 2023-10-13 19:25:42,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1476561.3333333333, ans=0.04949747468305833 2023-10-13 19:25:45,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1476561.3333333333, ans=0.125 2023-10-13 19:25:48,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1476608.0, ans=0.125 2023-10-13 19:26:09,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1476654.6666666667, ans=0.5 2023-10-13 19:26:15,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.818e+02 1.942e+02 2.194e+02 3.470e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-13 19:26:31,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1476748.0, ans=0.0 2023-10-13 19:26:57,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1476841.3333333333, ans=0.1 2023-10-13 19:27:00,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1476841.3333333333, ans=0.125 2023-10-13 19:27:08,220 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-10-13 19:27:16,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1476934.6666666667, ans=0.125 2023-10-13 19:27:31,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1476981.3333333333, ans=0.2 2023-10-13 19:27:56,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1477074.6666666667, ans=0.1 2023-10-13 19:28:05,464 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.86 vs. limit=15.0 2023-10-13 19:28:10,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.905e+02 2.042e+02 2.240e+02 3.212e+02, threshold=4.084e+02, percent-clipped=0.0 2023-10-13 19:28:37,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1477261.3333333333, ans=0.0 2023-10-13 19:28:42,189 INFO [train.py:1031] (0/4) Epoch 24, batch 2500, loss[loss=0.1698, simple_loss=0.2585, pruned_loss=0.04057, over 16760.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2801, pruned_loss=0.04788, over 23434946.66 frames. ], batch size: 67, lr: 1.44e-03, grad_scale: 32.0 2023-10-13 19:28:56,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1477354.6666666667, ans=0.0 2023-10-13 19:29:26,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1477448.0, ans=0.125 2023-10-13 19:29:28,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1477448.0, ans=0.0 2023-10-13 19:29:29,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1477494.6666666667, ans=0.125 2023-10-13 19:30:00,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1477588.0, ans=0.0 2023-10-13 19:30:08,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.861e+02 1.964e+02 2.131e+02 3.068e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-13 19:30:11,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1477634.6666666667, ans=0.2 2023-10-13 19:30:26,156 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.10 vs. limit=15.0 2023-10-13 19:30:28,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.63 vs. limit=15.0 2023-10-13 19:30:43,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1477774.6666666667, ans=0.035 2023-10-13 19:30:51,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1477774.6666666667, ans=0.125 2023-10-13 19:30:54,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1477821.3333333333, ans=0.125 2023-10-13 19:31:00,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1477821.3333333333, ans=0.0 2023-10-13 19:31:03,624 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=15.0 2023-10-13 19:31:06,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1477868.0, ans=0.125 2023-10-13 19:31:07,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.60 vs. limit=15.0 2023-10-13 19:31:20,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1477914.6666666667, ans=0.125 2023-10-13 19:31:52,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1478054.6666666667, ans=0.2 2023-10-13 19:31:58,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1478054.6666666667, ans=0.125 2023-10-13 19:32:02,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1478054.6666666667, ans=0.1 2023-10-13 19:32:03,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1478054.6666666667, ans=0.125 2023-10-13 19:32:04,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1478101.3333333333, ans=0.125 2023-10-13 19:32:06,686 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.824e+02 1.999e+02 2.180e+02 2.959e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-13 19:32:20,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1478148.0, ans=0.09899494936611666 2023-10-13 19:32:28,959 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-13 19:32:34,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1478194.6666666667, ans=0.125 2023-10-13 19:32:49,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1478241.3333333333, ans=0.0 2023-10-13 19:32:58,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1478288.0, ans=0.125 2023-10-13 19:33:03,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1478288.0, ans=0.0 2023-10-13 19:33:16,568 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.40 vs. limit=15.0 2023-10-13 19:33:28,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1478381.3333333333, ans=0.125 2023-10-13 19:34:39,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1478521.3333333333, ans=0.1 2023-10-13 19:34:44,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1478568.0, ans=0.125 2023-10-13 19:34:44,899 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.785e+02 1.940e+02 2.134e+02 2.848e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-13 19:35:15,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1478661.3333333333, ans=0.125 2023-10-13 19:35:17,912 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=22.5 2023-10-13 19:35:32,657 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=15.0 2023-10-13 19:36:56,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1478988.0, ans=0.5 2023-10-13 19:37:04,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1478988.0, ans=0.05 2023-10-13 19:37:11,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1479034.6666666667, ans=0.1 2023-10-13 19:37:12,851 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.775e+02 1.921e+02 2.192e+02 2.868e+02, threshold=3.842e+02, percent-clipped=0.0 2023-10-13 19:37:17,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-10-13 19:37:40,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1479128.0, ans=0.125 2023-10-13 19:37:46,309 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.12 vs. limit=15.0 2023-10-13 19:37:51,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1479174.6666666667, ans=0.125 2023-10-13 19:37:51,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1479174.6666666667, ans=0.125 2023-10-13 19:38:08,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1479221.3333333333, ans=0.0 2023-10-13 19:38:09,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1479221.3333333333, ans=0.0 2023-10-13 19:38:20,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1479268.0, ans=0.0 2023-10-13 19:39:03,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1479454.6666666667, ans=0.125 2023-10-13 19:39:11,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1479454.6666666667, ans=0.0 2023-10-13 19:39:19,385 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.765e+02 1.985e+02 2.204e+02 3.013e+02, threshold=3.970e+02, percent-clipped=0.0 2023-10-13 19:39:19,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1479501.3333333333, ans=0.125 2023-10-13 19:39:22,334 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:39:31,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1479548.0, ans=0.0 2023-10-13 19:39:31,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1479548.0, ans=0.125 2023-10-13 19:39:51,645 INFO [train.py:1031] (0/4) Epoch 24, batch 3000, loss[loss=0.179, simple_loss=0.2733, pruned_loss=0.04233, over 16957.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2792, pruned_loss=0.04775, over 25487587.68 frames. ], batch size: 165, lr: 1.44e-03, grad_scale: 16.0 2023-10-13 19:40:02,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1479688.0, ans=0.125 2023-10-13 19:40:09,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1479688.0, ans=0.0 2023-10-13 19:40:26,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1479781.3333333333, ans=0.2 2023-10-13 19:40:27,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1479781.3333333333, ans=0.0 2023-10-13 19:40:46,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=12.0 2023-10-13 19:40:51,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1479874.6666666667, ans=10.0 2023-10-13 19:41:11,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1479921.3333333333, ans=0.1 2023-10-13 19:41:18,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.879e+02 2.110e+02 2.472e+02 3.194e+02, threshold=4.220e+02, percent-clipped=0.0 2023-10-13 19:41:28,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1480014.6666666667, ans=0.1 2023-10-13 19:41:28,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1480014.6666666667, ans=0.125 2023-10-13 19:41:44,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1480061.3333333333, ans=0.125 2023-10-13 19:42:25,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1480201.3333333333, ans=10.0 2023-10-13 19:42:38,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1480248.0, ans=0.0 2023-10-13 19:42:54,187 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.72 vs. limit=15.0 2023-10-13 19:43:04,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1480388.0, ans=0.125 2023-10-13 19:43:16,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1480434.6666666667, ans=0.0 2023-10-13 19:43:17,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.837e+02 2.003e+02 2.305e+02 3.554e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-13 19:43:26,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1480481.3333333333, ans=0.1 2023-10-13 19:43:28,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1480481.3333333333, ans=0.125 2023-10-13 19:43:30,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.79 vs. limit=15.0 2023-10-13 19:43:43,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1480528.0, ans=0.125 2023-10-13 19:43:54,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1480574.6666666667, ans=0.1 2023-10-13 19:44:05,206 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.75 vs. limit=15.0 2023-10-13 19:44:37,077 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:44:41,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1480761.3333333333, ans=0.0 2023-10-13 19:44:50,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1480761.3333333333, ans=0.0 2023-10-13 19:45:10,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-10-13 19:45:14,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1480854.6666666667, ans=0.0 2023-10-13 19:45:15,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1480854.6666666667, ans=0.2 2023-10-13 19:45:24,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1480901.3333333333, ans=0.0 2023-10-13 19:45:28,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.886e+02 2.019e+02 2.408e+02 3.585e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-13 19:45:48,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1480948.0, ans=0.2 2023-10-13 19:45:52,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1480948.0, ans=0.0 2023-10-13 19:46:07,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1480994.6666666667, ans=0.0 2023-10-13 19:46:23,671 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-10-13 19:46:28,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.41 vs. limit=22.5 2023-10-13 19:46:33,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1481134.6666666667, ans=0.0 2023-10-13 19:46:33,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1481134.6666666667, ans=0.0 2023-10-13 19:47:07,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1481274.6666666667, ans=0.0 2023-10-13 19:47:38,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.824e+02 1.979e+02 2.191e+02 3.024e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-13 19:48:12,477 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 19:48:20,174 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.28 vs. limit=15.0 2023-10-13 19:48:52,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1481648.0, ans=0.0 2023-10-13 19:49:02,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.87 vs. limit=15.0 2023-10-13 19:49:02,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-10-13 19:49:28,673 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=15.0 2023-10-13 19:49:46,792 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.802e+02 1.967e+02 2.219e+02 3.023e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-13 19:50:08,559 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-10-13 19:50:15,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.70 vs. limit=22.5 2023-10-13 19:50:18,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1481974.6666666667, ans=0.1 2023-10-13 19:50:19,468 INFO [train.py:1031] (0/4) Epoch 24, batch 3500, loss[loss=0.1936, simple_loss=0.2877, pruned_loss=0.0497, over 16869.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2792, pruned_loss=0.04797, over 27093086.41 frames. ], batch size: 87, lr: 1.44e-03, grad_scale: 32.0 2023-10-13 19:50:32,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1482021.3333333333, ans=0.125 2023-10-13 19:50:36,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-10-13 19:50:40,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1482021.3333333333, ans=0.2 2023-10-13 19:50:44,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.95 vs. limit=15.0 2023-10-13 19:50:54,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.38 vs. limit=15.0 2023-10-13 19:51:18,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1482161.3333333333, ans=0.125 2023-10-13 19:51:18,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1482161.3333333333, ans=0.1 2023-10-13 19:51:18,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.59 vs. limit=6.0 2023-10-13 19:51:18,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.78 vs. limit=10.0 2023-10-13 19:51:32,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1482208.0, ans=0.125 2023-10-13 19:52:00,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.908e+02 2.055e+02 2.316e+02 3.688e+02, threshold=4.109e+02, percent-clipped=0.0 2023-10-13 19:52:31,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1482394.6666666667, ans=0.125 2023-10-13 19:53:09,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.36 vs. limit=10.0 2023-10-13 19:53:42,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1482628.0, ans=0.125 2023-10-13 19:53:48,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1482674.6666666667, ans=0.0 2023-10-13 19:53:53,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1482674.6666666667, ans=0.1 2023-10-13 19:53:56,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1482674.6666666667, ans=0.0 2023-10-13 19:54:15,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.734e+02 1.847e+02 2.041e+02 2.770e+02, threshold=3.694e+02, percent-clipped=0.0 2023-10-13 19:54:45,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1482861.3333333333, ans=0.5 2023-10-13 19:54:59,713 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=22.5 2023-10-13 19:55:10,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1482954.6666666667, ans=0.0 2023-10-13 19:55:38,419 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.12 vs. limit=22.5 2023-10-13 19:56:08,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1483141.3333333333, ans=0.1 2023-10-13 19:56:29,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1483234.6666666667, ans=0.0 2023-10-13 19:56:31,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.747e+02 1.864e+02 2.092e+02 2.829e+02, threshold=3.729e+02, percent-clipped=0.0 2023-10-13 19:57:07,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1483374.6666666667, ans=0.1 2023-10-13 19:57:09,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1483374.6666666667, ans=0.2 2023-10-13 19:57:22,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1483421.3333333333, ans=0.125 2023-10-13 19:57:39,127 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-10-13 19:57:46,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.67 vs. limit=22.5 2023-10-13 19:57:47,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1483514.6666666667, ans=0.2 2023-10-13 19:57:50,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=15.0 2023-10-13 19:57:52,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1483514.6666666667, ans=0.1 2023-10-13 19:58:07,565 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.77 vs. limit=22.5 2023-10-13 19:58:43,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.728e+02 1.851e+02 2.050e+02 3.109e+02, threshold=3.702e+02, percent-clipped=0.0 2023-10-13 19:58:43,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1483701.3333333333, ans=0.0 2023-10-13 19:58:51,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1483748.0, ans=0.2 2023-10-13 19:58:56,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=15.0 2023-10-13 19:59:44,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1483934.6666666667, ans=0.125 2023-10-13 19:59:47,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1483934.6666666667, ans=0.125 2023-10-13 19:59:58,095 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=12.0 2023-10-13 20:00:02,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1484028.0, ans=0.1 2023-10-13 20:00:06,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1484028.0, ans=0.125 2023-10-13 20:00:43,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.735e+02 1.880e+02 2.082e+02 3.056e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-13 20:00:44,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1484168.0, ans=0.0 2023-10-13 20:00:57,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1484214.6666666667, ans=0.125 2023-10-13 20:01:09,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-10-13 20:01:14,182 INFO [train.py:1031] (0/4) Epoch 24, batch 4000, loss[loss=0.1923, simple_loss=0.2608, pruned_loss=0.06187, over 12396.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2789, pruned_loss=0.04803, over 28349113.94 frames. ], batch size: 440, lr: 1.44e-03, grad_scale: 32.0 2023-10-13 20:01:18,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1484308.0, ans=0.0 2023-10-13 20:01:48,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1484401.3333333333, ans=0.125 2023-10-13 20:01:48,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1484401.3333333333, ans=0.1 2023-10-13 20:01:48,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.16 vs. limit=15.0 2023-10-13 20:01:51,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1484448.0, ans=0.1 2023-10-13 20:01:55,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1484448.0, ans=0.125 2023-10-13 20:01:59,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1484448.0, ans=0.125 2023-10-13 20:02:20,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1484541.3333333333, ans=0.125 2023-10-13 20:02:51,725 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.845e+02 2.070e+02 2.232e+02 2.882e+02, threshold=4.140e+02, percent-clipped=0.0 2023-10-13 20:02:53,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1484634.6666666667, ans=0.125 2023-10-13 20:02:59,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1484681.3333333333, ans=0.125 2023-10-13 20:03:02,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1484681.3333333333, ans=0.0 2023-10-13 20:03:02,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1484681.3333333333, ans=0.0 2023-10-13 20:03:06,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1484681.3333333333, ans=0.2 2023-10-13 20:04:18,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1484961.3333333333, ans=0.1 2023-10-13 20:05:00,755 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.829e+02 1.989e+02 2.181e+02 3.460e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-13 20:05:07,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1485101.3333333333, ans=0.125 2023-10-13 20:05:11,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1485148.0, ans=0.125 2023-10-13 20:05:27,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1485194.6666666667, ans=0.125 2023-10-13 20:05:36,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1485194.6666666667, ans=0.125 2023-10-13 20:06:58,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1485474.6666666667, ans=0.125 2023-10-13 20:06:58,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1485474.6666666667, ans=0.1 2023-10-13 20:07:17,042 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:07:19,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.84 vs. limit=15.0 2023-10-13 20:07:26,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.799e+02 1.940e+02 2.127e+02 2.910e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-13 20:07:41,401 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-10-13 20:07:51,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1485661.3333333333, ans=0.125 2023-10-13 20:08:00,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1485708.0, ans=0.2 2023-10-13 20:08:22,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1485801.3333333333, ans=0.125 2023-10-13 20:08:23,291 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.76 vs. limit=10.0 2023-10-13 20:08:27,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1485801.3333333333, ans=0.0 2023-10-13 20:08:33,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1485848.0, ans=0.125 2023-10-13 20:09:22,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1486034.6666666667, ans=0.125 2023-10-13 20:09:25,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.838e+02 1.984e+02 2.202e+02 2.845e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-13 20:09:37,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1486081.3333333333, ans=0.2 2023-10-13 20:10:11,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486221.3333333333, ans=0.1 2023-10-13 20:10:48,602 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.64 vs. limit=15.0 2023-10-13 20:10:59,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1486361.3333333333, ans=0.2 2023-10-13 20:11:03,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1486408.0, ans=0.2 2023-10-13 20:11:08,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1486408.0, ans=0.09899494936611666 2023-10-13 20:11:32,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1486501.3333333333, ans=0.125 2023-10-13 20:11:32,881 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.27 vs. limit=22.5 2023-10-13 20:11:34,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.797e+02 1.973e+02 2.149e+02 3.326e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-13 20:12:07,115 INFO [train.py:1031] (0/4) Epoch 24, batch 4500, loss[loss=0.1819, simple_loss=0.2727, pruned_loss=0.04549, over 16653.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2793, pruned_loss=0.04788, over 29327687.70 frames. ], batch size: 241, lr: 1.44e-03, grad_scale: 16.0 2023-10-13 20:12:24,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1486688.0, ans=0.09899494936611666 2023-10-13 20:12:47,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1486781.3333333333, ans=0.125 2023-10-13 20:12:54,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1486828.0, ans=0.125 2023-10-13 20:12:57,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1486828.0, ans=0.125 2023-10-13 20:13:06,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1486874.6666666667, ans=0.0 2023-10-13 20:13:21,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1486921.3333333333, ans=0.09899494936611666 2023-10-13 20:13:36,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.748e+02 1.897e+02 2.072e+02 3.009e+02, threshold=3.793e+02, percent-clipped=0.0 2023-10-13 20:13:50,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1487014.6666666667, ans=0.2 2023-10-13 20:14:07,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1487108.0, ans=0.1 2023-10-13 20:14:14,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1487108.0, ans=0.0 2023-10-13 20:14:40,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1487248.0, ans=0.2 2023-10-13 20:15:11,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1487388.0, ans=0.05 2023-10-13 20:15:21,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1487388.0, ans=0.0 2023-10-13 20:15:23,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1487388.0, ans=0.125 2023-10-13 20:15:32,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.814e+02 1.983e+02 2.147e+02 2.843e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-13 20:15:41,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1487481.3333333333, ans=0.0 2023-10-13 20:15:55,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1487528.0, ans=0.125 2023-10-13 20:15:58,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1487528.0, ans=0.125 2023-10-13 20:15:59,318 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-13 20:16:11,084 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:16:13,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1487574.6666666667, ans=0.0 2023-10-13 20:16:35,843 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.30 vs. limit=15.0 2023-10-13 20:16:39,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1487668.0, ans=0.2 2023-10-13 20:16:51,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1487714.6666666667, ans=0.125 2023-10-13 20:16:55,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1487714.6666666667, ans=0.0 2023-10-13 20:17:03,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1487761.3333333333, ans=0.125 2023-10-13 20:17:12,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-10-13 20:17:16,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1487808.0, ans=0.125 2023-10-13 20:17:25,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1487854.6666666667, ans=0.125 2023-10-13 20:17:39,582 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.817e+02 1.989e+02 2.174e+02 3.562e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-13 20:17:43,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1487901.3333333333, ans=0.0 2023-10-13 20:17:48,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1487948.0, ans=0.015 2023-10-13 20:17:53,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-10-13 20:18:02,451 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.98 vs. limit=22.5 2023-10-13 20:18:11,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1488041.3333333333, ans=0.2 2023-10-13 20:18:19,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488088.0, ans=0.1 2023-10-13 20:18:21,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-10-13 20:18:30,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1488088.0, ans=0.125 2023-10-13 20:18:33,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.39 vs. limit=22.5 2023-10-13 20:18:40,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1488134.6666666667, ans=0.1 2023-10-13 20:18:51,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1488181.3333333333, ans=0.0 2023-10-13 20:18:52,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1488181.3333333333, ans=0.125 2023-10-13 20:18:57,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1488181.3333333333, ans=0.125 2023-10-13 20:19:22,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1488274.6666666667, ans=0.025 2023-10-13 20:19:41,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.802e+02 1.954e+02 2.159e+02 2.924e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-13 20:20:06,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1488461.3333333333, ans=0.2 2023-10-13 20:20:28,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1488554.6666666667, ans=0.04949747468305833 2023-10-13 20:20:38,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1488601.3333333333, ans=0.125 2023-10-13 20:20:43,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1488601.3333333333, ans=0.0 2023-10-13 20:21:05,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1488694.6666666667, ans=0.125 2023-10-13 20:21:09,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1488694.6666666667, ans=0.125 2023-10-13 20:21:14,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1488694.6666666667, ans=0.125 2023-10-13 20:21:21,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1488741.3333333333, ans=0.0 2023-10-13 20:21:33,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1488788.0, ans=0.125 2023-10-13 20:21:39,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=12.0 2023-10-13 20:21:50,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.829e+02 1.969e+02 2.151e+02 3.037e+02, threshold=3.937e+02, percent-clipped=0.0 2023-10-13 20:21:50,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1488834.6666666667, ans=0.2 2023-10-13 20:21:59,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1488881.3333333333, ans=0.2 2023-10-13 20:21:59,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1488881.3333333333, ans=0.125 2023-10-13 20:22:00,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1488881.3333333333, ans=0.125 2023-10-13 20:22:10,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1488928.0, ans=0.2 2023-10-13 20:22:12,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1488928.0, ans=0.0 2023-10-13 20:22:22,487 INFO [train.py:1031] (0/4) Epoch 24, batch 5000, loss[loss=0.2161, simple_loss=0.2989, pruned_loss=0.06671, over 16597.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2792, pruned_loss=0.04807, over 30099345.48 frames. ], batch size: 56, lr: 1.44e-03, grad_scale: 16.0 2023-10-13 20:22:43,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1489021.3333333333, ans=0.0 2023-10-13 20:22:51,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1489068.0, ans=0.125 2023-10-13 20:23:12,002 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=15.0 2023-10-13 20:23:26,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1489208.0, ans=0.0 2023-10-13 20:23:33,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1489254.6666666667, ans=0.0 2023-10-13 20:23:53,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.819e+02 1.959e+02 2.203e+02 3.725e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-13 20:23:55,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1489301.3333333333, ans=0.0 2023-10-13 20:24:27,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1489441.3333333333, ans=0.125 2023-10-13 20:24:27,077 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:24:45,859 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-10-13 20:25:14,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1489581.3333333333, ans=0.1 2023-10-13 20:25:16,597 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-10-13 20:25:18,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1489628.0, ans=0.125 2023-10-13 20:25:41,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1489721.3333333333, ans=0.015 2023-10-13 20:25:46,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1489721.3333333333, ans=0.125 2023-10-13 20:26:00,109 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1489768.0, ans=0.5 2023-10-13 20:26:00,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.835e+02 1.945e+02 2.232e+02 2.832e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-13 20:26:10,277 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.93 vs. limit=22.5 2023-10-13 20:26:22,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-10-13 20:26:27,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1489861.3333333333, ans=0.125 2023-10-13 20:26:45,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1489954.6666666667, ans=0.1 2023-10-13 20:26:50,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-10-13 20:27:05,113 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.05 vs. limit=15.0 2023-10-13 20:27:18,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1490048.0, ans=0.125 2023-10-13 20:27:36,864 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.22 vs. limit=22.5 2023-10-13 20:27:52,904 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-10-13 20:28:20,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.900e+02 2.127e+02 2.360e+02 3.191e+02, threshold=4.254e+02, percent-clipped=0.0 2023-10-13 20:28:51,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1490374.6666666667, ans=0.07 2023-10-13 20:28:52,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1490374.6666666667, ans=0.1 2023-10-13 20:28:58,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1490374.6666666667, ans=0.125 2023-10-13 20:29:01,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1490421.3333333333, ans=0.125 2023-10-13 20:29:15,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1490468.0, ans=0.125 2023-10-13 20:29:48,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1490561.3333333333, ans=0.2 2023-10-13 20:29:52,482 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.86 vs. limit=22.5 2023-10-13 20:30:08,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1490654.6666666667, ans=0.2 2023-10-13 20:30:24,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1490701.3333333333, ans=0.0 2023-10-13 20:30:26,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1490701.3333333333, ans=0.125 2023-10-13 20:30:35,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.745e+02 1.939e+02 2.233e+02 2.996e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-13 20:30:49,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1490794.6666666667, ans=0.125 2023-10-13 20:31:00,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1490794.6666666667, ans=0.125 2023-10-13 20:31:07,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1490841.3333333333, ans=0.125 2023-10-13 20:31:23,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1490888.0, ans=0.125 2023-10-13 20:32:15,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1491074.6666666667, ans=0.125 2023-10-13 20:32:19,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1491074.6666666667, ans=0.1 2023-10-13 20:32:26,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1491121.3333333333, ans=0.125 2023-10-13 20:32:39,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1491168.0, ans=0.125 2023-10-13 20:32:44,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.763e+02 1.916e+02 2.100e+02 3.166e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-13 20:32:44,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1491168.0, ans=0.0 2023-10-13 20:33:08,937 INFO [train.py:1031] (0/4) Epoch 24, batch 5500, loss[loss=0.1781, simple_loss=0.2694, pruned_loss=0.04344, over 16902.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2791, pruned_loss=0.04797, over 30729131.18 frames. ], batch size: 72, lr: 1.44e-03, grad_scale: 8.0 2023-10-13 20:33:24,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1491354.6666666667, ans=0.0 2023-10-13 20:33:41,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1491401.3333333333, ans=0.0 2023-10-13 20:33:56,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1491494.6666666667, ans=0.0 2023-10-13 20:34:00,502 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.49 vs. limit=22.5 2023-10-13 20:34:13,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491541.3333333333, ans=0.1 2023-10-13 20:34:33,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1491634.6666666667, ans=0.125 2023-10-13 20:34:34,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1491634.6666666667, ans=0.125 2023-10-13 20:34:39,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.63 vs. limit=15.0 2023-10-13 20:34:41,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.786e+02 1.976e+02 2.269e+02 3.044e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-13 20:34:42,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1491681.3333333333, ans=0.0 2023-10-13 20:34:52,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1491681.3333333333, ans=0.125 2023-10-13 20:34:59,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1491728.0, ans=0.0 2023-10-13 20:35:21,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1491821.3333333333, ans=0.1 2023-10-13 20:35:24,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1491821.3333333333, ans=0.125 2023-10-13 20:35:28,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1491868.0, ans=0.2 2023-10-13 20:35:31,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1491868.0, ans=0.125 2023-10-13 20:35:34,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1491868.0, ans=0.125 2023-10-13 20:35:42,279 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-10-13 20:35:55,414 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-10-13 20:36:07,064 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.70 vs. limit=22.5 2023-10-13 20:36:35,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.765e+02 1.937e+02 2.133e+02 2.789e+02, threshold=3.874e+02, percent-clipped=0.0 2023-10-13 20:36:41,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1492148.0, ans=0.125 2023-10-13 20:36:51,009 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-10-13 20:36:55,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.whiten.whitening_limit, batch_count=1492194.6666666667, ans=12.0 2023-10-13 20:37:02,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1492241.3333333333, ans=0.125 2023-10-13 20:37:29,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1492334.6666666667, ans=0.0 2023-10-13 20:37:57,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.46 vs. limit=15.0 2023-10-13 20:38:31,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1492568.0, ans=0.1 2023-10-13 20:38:41,647 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-13 20:38:41,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.815e+02 1.960e+02 2.161e+02 2.790e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-13 20:38:47,228 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-10-13 20:38:58,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1492661.3333333333, ans=0.125 2023-10-13 20:39:11,307 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.56 vs. limit=22.5 2023-10-13 20:39:14,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1492708.0, ans=0.0 2023-10-13 20:39:29,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1492754.6666666667, ans=0.125 2023-10-13 20:39:50,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1492848.0, ans=0.1 2023-10-13 20:40:05,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1492894.6666666667, ans=0.0 2023-10-13 20:40:27,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1492988.0, ans=0.2 2023-10-13 20:40:41,837 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-10-13 20:40:51,686 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.843e+02 1.993e+02 2.201e+02 2.970e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-13 20:41:09,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1493128.0, ans=0.1 2023-10-13 20:41:31,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1493174.6666666667, ans=0.0 2023-10-13 20:41:37,938 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:41:54,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1493268.0, ans=0.125 2023-10-13 20:42:09,998 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-320000.pt 2023-10-13 20:42:40,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1493408.0, ans=0.125 2023-10-13 20:43:06,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1493501.3333333333, ans=0.0 2023-10-13 20:43:15,806 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.859e+02 2.041e+02 2.194e+02 3.007e+02, threshold=4.081e+02, percent-clipped=0.0 2023-10-13 20:43:19,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1493548.0, ans=0.2 2023-10-13 20:43:28,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1493594.6666666667, ans=0.1 2023-10-13 20:43:38,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1493594.6666666667, ans=0.125 2023-10-13 20:43:42,324 INFO [train.py:1031] (0/4) Epoch 24, batch 6000, loss[loss=0.1835, simple_loss=0.2759, pruned_loss=0.04555, over 15454.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2792, pruned_loss=0.048, over 31189817.69 frames. ], batch size: 35, lr: 1.44e-03, grad_scale: 16.0 2023-10-13 20:43:55,697 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.28 vs. limit=22.5 2023-10-13 20:44:00,052 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:44:02,791 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.17 vs. limit=22.5 2023-10-13 20:44:07,864 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.09 vs. limit=15.0 2023-10-13 20:44:45,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1493828.0, ans=0.125 2023-10-13 20:44:56,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1493874.6666666667, ans=0.1 2023-10-13 20:45:12,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1493968.0, ans=0.1 2023-10-13 20:45:23,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.806e+02 2.001e+02 2.189e+02 3.130e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-13 20:45:28,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.76 vs. limit=6.0 2023-10-13 20:46:00,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1494108.0, ans=0.0 2023-10-13 20:46:16,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-10-13 20:46:21,332 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=12.0 2023-10-13 20:46:43,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1494294.6666666667, ans=0.125 2023-10-13 20:46:52,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1494341.3333333333, ans=0.125 2023-10-13 20:46:55,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1494341.3333333333, ans=0.0 2023-10-13 20:47:02,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.55 vs. limit=12.0 2023-10-13 20:47:05,423 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.67 vs. limit=15.0 2023-10-13 20:47:07,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1494388.0, ans=0.2 2023-10-13 20:47:29,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.863e+02 2.002e+02 2.323e+02 3.238e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-13 20:47:33,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1494481.3333333333, ans=0.125 2023-10-13 20:47:37,833 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.59 vs. limit=15.0 2023-10-13 20:48:01,667 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.58 vs. limit=22.5 2023-10-13 20:48:04,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1494574.6666666667, ans=0.125 2023-10-13 20:48:14,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1494621.3333333333, ans=0.125 2023-10-13 20:48:14,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten.whitening_limit, batch_count=1494621.3333333333, ans=22.5 2023-10-13 20:48:24,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1494668.0, ans=0.0 2023-10-13 20:48:28,464 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:48:29,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.47 vs. limit=15.0 2023-10-13 20:49:13,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1494854.6666666667, ans=0.1 2023-10-13 20:49:39,329 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.830e+02 2.035e+02 2.330e+02 3.349e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-13 20:50:09,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1495041.3333333333, ans=0.1 2023-10-13 20:50:09,849 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-10-13 20:50:10,859 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.32 vs. limit=15.0 2023-10-13 20:50:36,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1495134.6666666667, ans=0.125 2023-10-13 20:50:36,199 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:50:48,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.74 vs. limit=22.5 2023-10-13 20:50:54,329 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.03 vs. limit=22.5 2023-10-13 20:51:01,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1495228.0, ans=0.1 2023-10-13 20:51:05,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1495228.0, ans=0.125 2023-10-13 20:51:05,549 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 20:51:12,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1495274.6666666667, ans=0.125 2023-10-13 20:51:24,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1495274.6666666667, ans=0.09899494936611666 2023-10-13 20:51:26,669 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=22.5 2023-10-13 20:51:29,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.76 vs. limit=15.0 2023-10-13 20:51:37,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.82 vs. limit=12.0 2023-10-13 20:51:43,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-10-13 20:51:47,121 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-10-13 20:51:52,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.810e+02 2.047e+02 2.268e+02 5.508e+02, threshold=4.094e+02, percent-clipped=1.0 2023-10-13 20:51:56,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1495414.6666666667, ans=0.0 2023-10-13 20:52:34,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1495554.6666666667, ans=0.125 2023-10-13 20:52:58,810 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.21 vs. limit=22.5 2023-10-13 20:53:07,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1495694.6666666667, ans=0.1 2023-10-13 20:53:28,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1495788.0, ans=0.1 2023-10-13 20:53:34,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1495788.0, ans=0.015 2023-10-13 20:53:55,841 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.804e+02 1.971e+02 2.270e+02 3.189e+02, threshold=3.942e+02, percent-clipped=0.0 2023-10-13 20:54:06,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1495928.0, ans=0.125 2023-10-13 20:54:11,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1495928.0, ans=0.04949747468305833 2023-10-13 20:54:11,354 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-10-13 20:54:19,769 INFO [train.py:1031] (0/4) Epoch 24, batch 6500, loss[loss=0.2043, simple_loss=0.2898, pruned_loss=0.05941, over 15951.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2796, pruned_loss=0.04814, over 31525644.52 frames. ], batch size: 296, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 20:54:21,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.81 vs. limit=15.0 2023-10-13 20:54:38,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.31 vs. limit=15.0 2023-10-13 20:54:41,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1496021.3333333333, ans=0.0 2023-10-13 20:55:04,624 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-10-13 20:55:08,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1496114.6666666667, ans=0.125 2023-10-13 20:55:18,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1496114.6666666667, ans=0.125 2023-10-13 20:55:19,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.47 vs. limit=15.0 2023-10-13 20:55:25,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1496161.3333333333, ans=0.0 2023-10-13 20:55:27,284 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-10-13 20:55:33,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.83 vs. limit=15.0 2023-10-13 20:56:01,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1496301.3333333333, ans=0.0 2023-10-13 20:56:13,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.824e+02 2.010e+02 2.224e+02 3.840e+02, threshold=4.020e+02, percent-clipped=0.0 2023-10-13 20:56:19,587 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-10-13 20:56:24,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1496394.6666666667, ans=0.0 2023-10-13 20:56:55,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1496488.0, ans=0.0 2023-10-13 20:57:22,074 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1496581.3333333333, ans=0.1 2023-10-13 20:57:26,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1496628.0, ans=0.125 2023-10-13 20:57:28,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1496628.0, ans=0.2 2023-10-13 20:57:39,950 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.67 vs. limit=6.0 2023-10-13 20:57:39,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=1496674.6666666667, ans=22.5 2023-10-13 20:57:50,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1496674.6666666667, ans=0.125 2023-10-13 20:57:58,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1496721.3333333333, ans=0.125 2023-10-13 20:58:16,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.827e+02 1.985e+02 2.156e+02 2.860e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-13 20:58:20,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1496814.6666666667, ans=0.125 2023-10-13 20:58:30,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1496861.3333333333, ans=0.2 2023-10-13 20:58:35,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1496861.3333333333, ans=10.0 2023-10-13 20:58:39,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1496861.3333333333, ans=0.0 2023-10-13 20:58:41,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1496908.0, ans=0.0 2023-10-13 20:58:49,920 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:00:02,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.15 vs. limit=10.0 2023-10-13 21:00:08,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1497234.6666666667, ans=0.125 2023-10-13 21:00:19,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.772e+02 1.893e+02 2.084e+02 2.902e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-13 21:00:55,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1497374.6666666667, ans=0.0 2023-10-13 21:01:19,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1497421.3333333333, ans=0.1 2023-10-13 21:01:29,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1497468.0, ans=0.125 2023-10-13 21:01:34,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1497468.0, ans=0.2 2023-10-13 21:01:49,421 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.12 vs. limit=22.5 2023-10-13 21:01:52,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1497561.3333333333, ans=0.125 2023-10-13 21:02:40,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1497701.3333333333, ans=0.125 2023-10-13 21:02:49,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.686e+02 1.833e+02 2.073e+02 2.986e+02, threshold=3.666e+02, percent-clipped=0.0 2023-10-13 21:02:49,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1497748.0, ans=0.0 2023-10-13 21:03:01,256 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.83 vs. limit=15.0 2023-10-13 21:03:05,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-10-13 21:03:13,939 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.94 vs. limit=15.0 2023-10-13 21:03:21,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1497888.0, ans=0.0 2023-10-13 21:03:48,558 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-10-13 21:04:02,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1497981.3333333333, ans=0.07 2023-10-13 21:04:13,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1498028.0, ans=0.2 2023-10-13 21:04:28,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1498074.6666666667, ans=0.125 2023-10-13 21:04:31,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1498074.6666666667, ans=0.0 2023-10-13 21:05:00,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.851e+02 2.104e+02 2.321e+02 3.870e+02, threshold=4.208e+02, percent-clipped=1.0 2023-10-13 21:05:04,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1498214.6666666667, ans=0.02 2023-10-13 21:05:22,729 INFO [train.py:1031] (0/4) Epoch 24, batch 7000, loss[loss=0.2237, simple_loss=0.3036, pruned_loss=0.07195, over 16020.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2798, pruned_loss=0.0479, over 31795211.70 frames. ], batch size: 296, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 21:05:48,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1498354.6666666667, ans=0.0 2023-10-13 21:06:12,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.27 vs. limit=15.0 2023-10-13 21:07:07,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1498634.6666666667, ans=0.2 2023-10-13 21:07:07,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-10-13 21:07:10,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1498634.6666666667, ans=0.2 2023-10-13 21:07:12,862 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=15.0 2023-10-13 21:07:14,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.780e+02 1.929e+02 2.087e+02 3.734e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-13 21:07:21,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1498681.3333333333, ans=0.125 2023-10-13 21:07:32,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1498728.0, ans=0.125 2023-10-13 21:07:41,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1498774.6666666667, ans=0.0 2023-10-13 21:07:48,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1498821.3333333333, ans=0.125 2023-10-13 21:08:21,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=1498914.6666666667, ans=15.0 2023-10-13 21:08:42,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1499008.0, ans=0.125 2023-10-13 21:09:14,888 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.799e+02 1.931e+02 2.121e+02 2.811e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-13 21:09:49,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1499241.3333333333, ans=0.125 2023-10-13 21:09:54,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1499241.3333333333, ans=0.2 2023-10-13 21:10:15,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1499288.0, ans=0.0 2023-10-13 21:10:25,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1499334.6666666667, ans=0.0 2023-10-13 21:10:27,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1499334.6666666667, ans=0.2 2023-10-13 21:10:32,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1499381.3333333333, ans=0.125 2023-10-13 21:10:51,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1499428.0, ans=0.1 2023-10-13 21:11:40,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1499568.0, ans=0.125 2023-10-13 21:11:40,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-10-13 21:11:48,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.761e+02 1.974e+02 2.191e+02 4.595e+02, threshold=3.948e+02, percent-clipped=1.0 2023-10-13 21:11:55,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1499614.6666666667, ans=0.2 2023-10-13 21:12:12,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1499708.0, ans=0.2 2023-10-13 21:12:19,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1499708.0, ans=0.2 2023-10-13 21:12:30,087 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.25 vs. limit=22.5 2023-10-13 21:12:58,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1499848.0, ans=0.125 2023-10-13 21:13:13,780 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:13:13,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1499894.6666666667, ans=0.125 2023-10-13 21:13:14,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1499894.6666666667, ans=0.2 2023-10-13 21:13:16,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1499894.6666666667, ans=0.0 2023-10-13 21:13:38,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.87 vs. limit=22.5 2023-10-13 21:13:40,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1499988.0, ans=0.0 2023-10-13 21:13:49,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1499988.0, ans=0.0 2023-10-13 21:13:59,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1500034.6666666667, ans=0.1 2023-10-13 21:14:06,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.748e+02 1.903e+02 2.094e+02 3.174e+02, threshold=3.806e+02, percent-clipped=0.0 2023-10-13 21:14:22,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500128.0, ans=0.1 2023-10-13 21:14:32,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1500174.6666666667, ans=0.125 2023-10-13 21:14:43,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-10-13 21:14:53,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500268.0, ans=0.1 2023-10-13 21:15:00,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.09 vs. limit=22.5 2023-10-13 21:15:03,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1500314.6666666667, ans=0.1 2023-10-13 21:15:10,432 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.10 vs. limit=22.5 2023-10-13 21:15:16,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1500361.3333333333, ans=0.2 2023-10-13 21:15:18,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500361.3333333333, ans=0.1 2023-10-13 21:15:24,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.82 vs. limit=12.0 2023-10-13 21:15:30,602 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.58 vs. limit=15.0 2023-10-13 21:15:35,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=1500408.0, ans=0.02 2023-10-13 21:15:38,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1500408.0, ans=0.0 2023-10-13 21:15:51,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1500501.3333333333, ans=0.0 2023-10-13 21:16:08,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.833e+02 2.032e+02 2.250e+02 3.239e+02, threshold=4.063e+02, percent-clipped=0.0 2023-10-13 21:16:09,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1500548.0, ans=0.04949747468305833 2023-10-13 21:16:18,184 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.86 vs. limit=15.0 2023-10-13 21:16:27,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1500594.6666666667, ans=0.0 2023-10-13 21:16:31,129 INFO [train.py:1031] (0/4) Epoch 24, batch 7500, loss[loss=0.1888, simple_loss=0.2803, pruned_loss=0.04867, over 16850.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2798, pruned_loss=0.04791, over 32016654.16 frames. ], batch size: 175, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 21:16:39,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1500641.3333333333, ans=0.2 2023-10-13 21:16:46,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500688.0, ans=0.1 2023-10-13 21:16:46,580 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-10-13 21:16:52,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1500688.0, ans=0.1 2023-10-13 21:16:53,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1500688.0, ans=0.125 2023-10-13 21:17:06,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1500734.6666666667, ans=0.125 2023-10-13 21:18:22,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1501014.6666666667, ans=0.1 2023-10-13 21:18:25,095 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.23 vs. limit=15.0 2023-10-13 21:18:25,360 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.856e+02 2.084e+02 2.355e+02 3.282e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-13 21:18:53,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1501108.0, ans=0.125 2023-10-13 21:19:02,174 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.93 vs. limit=15.0 2023-10-13 21:19:02,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1501154.6666666667, ans=0.125 2023-10-13 21:19:06,666 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.48 vs. limit=22.5 2023-10-13 21:19:28,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1501201.3333333333, ans=0.2 2023-10-13 21:19:36,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1501248.0, ans=0.0 2023-10-13 21:19:39,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1501248.0, ans=0.2 2023-10-13 21:19:43,322 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.83 vs. limit=22.5 2023-10-13 21:20:40,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1501434.6666666667, ans=0.1 2023-10-13 21:20:44,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1501434.6666666667, ans=0.1 2023-10-13 21:20:48,688 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.872e+02 2.068e+02 2.239e+02 3.340e+02, threshold=4.136e+02, percent-clipped=0.0 2023-10-13 21:21:03,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1501528.0, ans=0.0 2023-10-13 21:21:08,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.08 vs. limit=12.0 2023-10-13 21:21:24,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1501621.3333333333, ans=0.125 2023-10-13 21:21:34,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1501621.3333333333, ans=0.2 2023-10-13 21:21:46,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1501668.0, ans=0.125 2023-10-13 21:22:20,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1501761.3333333333, ans=0.125 2023-10-13 21:22:30,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1501761.3333333333, ans=0.2 2023-10-13 21:22:32,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1501808.0, ans=0.125 2023-10-13 21:22:47,126 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.38 vs. limit=15.0 2023-10-13 21:23:08,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1501901.3333333333, ans=0.2 2023-10-13 21:23:20,853 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.771e+02 1.944e+02 2.197e+02 2.926e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-13 21:23:27,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1501948.0, ans=0.0 2023-10-13 21:23:51,389 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.77 vs. limit=15.0 2023-10-13 21:24:35,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1502181.3333333333, ans=0.0 2023-10-13 21:25:16,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1502274.6666666667, ans=0.2 2023-10-13 21:25:17,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=15.0 2023-10-13 21:25:18,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1502274.6666666667, ans=0.0 2023-10-13 21:25:40,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1502368.0, ans=0.1 2023-10-13 21:25:56,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.819e+02 1.925e+02 2.103e+02 2.878e+02, threshold=3.851e+02, percent-clipped=0.0 2023-10-13 21:26:09,219 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-10-13 21:26:28,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1502508.0, ans=0.09899494936611666 2023-10-13 21:26:34,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1502554.6666666667, ans=0.2 2023-10-13 21:26:42,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1502554.6666666667, ans=0.0 2023-10-13 21:26:52,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1502601.3333333333, ans=0.1 2023-10-13 21:26:54,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1502601.3333333333, ans=0.05 2023-10-13 21:27:08,848 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.65 vs. limit=12.0 2023-10-13 21:27:09,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1502648.0, ans=0.2 2023-10-13 21:27:11,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1502648.0, ans=0.1 2023-10-13 21:27:33,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1502741.3333333333, ans=0.2 2023-10-13 21:27:40,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1502788.0, ans=0.125 2023-10-13 21:27:48,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1502788.0, ans=0.2 2023-10-13 21:27:53,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1502834.6666666667, ans=0.0 2023-10-13 21:28:09,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1502881.3333333333, ans=0.0 2023-10-13 21:28:10,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.707e+02 1.814e+02 1.947e+02 2.778e+02, threshold=3.629e+02, percent-clipped=0.0 2023-10-13 21:28:26,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1502928.0, ans=0.125 2023-10-13 21:28:26,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1502928.0, ans=0.125 2023-10-13 21:28:31,878 INFO [train.py:1031] (0/4) Epoch 24, batch 8000, loss[loss=0.19, simple_loss=0.2764, pruned_loss=0.05183, over 16654.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2792, pruned_loss=0.04729, over 32223331.77 frames. ], batch size: 56, lr: 1.43e-03, grad_scale: 32.0 2023-10-13 21:28:50,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1503021.3333333333, ans=0.07 2023-10-13 21:28:54,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1503021.3333333333, ans=0.125 2023-10-13 21:29:16,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1503114.6666666667, ans=0.125 2023-10-13 21:29:23,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1503161.3333333333, ans=0.1 2023-10-13 21:29:27,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1503161.3333333333, ans=0.0 2023-10-13 21:29:32,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-10-13 21:29:40,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.97 vs. limit=22.5 2023-10-13 21:29:45,095 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:29:56,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1503254.6666666667, ans=0.125 2023-10-13 21:29:58,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1503301.3333333333, ans=0.125 2023-10-13 21:30:00,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1503301.3333333333, ans=0.125 2023-10-13 21:30:01,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1503301.3333333333, ans=0.0 2023-10-13 21:30:04,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1503301.3333333333, ans=0.125 2023-10-13 21:30:13,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.866e+02 2.082e+02 2.483e+02 3.426e+02, threshold=4.164e+02, percent-clipped=0.0 2023-10-13 21:30:18,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1503348.0, ans=0.0 2023-10-13 21:30:53,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1503488.0, ans=0.2 2023-10-13 21:31:03,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1503534.6666666667, ans=0.1 2023-10-13 21:31:05,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1503534.6666666667, ans=0.125 2023-10-13 21:31:41,440 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.12 vs. limit=15.0 2023-10-13 21:31:43,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1503674.6666666667, ans=0.0 2023-10-13 21:31:48,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1503674.6666666667, ans=0.0 2023-10-13 21:31:55,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1503721.3333333333, ans=0.125 2023-10-13 21:32:06,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1503721.3333333333, ans=0.0 2023-10-13 21:32:34,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-10-13 21:32:34,444 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.760e+02 1.949e+02 2.138e+02 2.963e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-13 21:32:38,788 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.19 vs. limit=15.0 2023-10-13 21:32:53,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1503861.3333333333, ans=0.125 2023-10-13 21:32:57,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1503861.3333333333, ans=0.2 2023-10-13 21:32:59,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1503861.3333333333, ans=0.2 2023-10-13 21:33:17,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-10-13 21:33:20,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1503954.6666666667, ans=0.125 2023-10-13 21:33:38,768 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.22 vs. limit=15.0 2023-10-13 21:33:41,018 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.18 vs. limit=6.0 2023-10-13 21:33:41,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1504001.3333333333, ans=0.125 2023-10-13 21:34:10,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1504141.3333333333, ans=0.1 2023-10-13 21:34:16,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=1504141.3333333333, ans=0.1 2023-10-13 21:34:19,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1504141.3333333333, ans=0.09899494936611666 2023-10-13 21:34:30,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1504188.0, ans=0.125 2023-10-13 21:34:32,001 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:34:43,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1504281.3333333333, ans=0.1 2023-10-13 21:34:47,464 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.801e+02 1.915e+02 2.100e+02 3.063e+02, threshold=3.830e+02, percent-clipped=0.0 2023-10-13 21:34:57,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1504328.0, ans=0.2 2023-10-13 21:34:59,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1504328.0, ans=0.125 2023-10-13 21:34:59,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1504328.0, ans=0.125 2023-10-13 21:35:06,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1504328.0, ans=0.0 2023-10-13 21:35:07,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1504328.0, ans=0.125 2023-10-13 21:35:30,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-10-13 21:35:43,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1504468.0, ans=0.125 2023-10-13 21:36:41,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1504654.6666666667, ans=0.2 2023-10-13 21:37:05,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.764e+02 1.883e+02 2.123e+02 2.958e+02, threshold=3.766e+02, percent-clipped=0.0 2023-10-13 21:37:06,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1504748.0, ans=0.125 2023-10-13 21:37:10,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1504748.0, ans=0.125 2023-10-13 21:37:12,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.81 vs. limit=22.5 2023-10-13 21:37:14,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1504794.6666666667, ans=0.0 2023-10-13 21:37:23,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1504794.6666666667, ans=0.1 2023-10-13 21:37:32,838 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-13 21:38:11,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1504981.3333333333, ans=0.1 2023-10-13 21:38:19,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1504981.3333333333, ans=0.125 2023-10-13 21:38:21,180 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-10-13 21:38:26,772 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-10-13 21:38:28,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1505028.0, ans=0.125 2023-10-13 21:38:42,231 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.91 vs. limit=22.5 2023-10-13 21:39:18,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1505168.0, ans=0.125 2023-10-13 21:39:20,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1505214.6666666667, ans=0.0 2023-10-13 21:39:27,199 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.843e+02 1.978e+02 2.134e+02 3.003e+02, threshold=3.957e+02, percent-clipped=0.0 2023-10-13 21:39:30,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1505214.6666666667, ans=0.125 2023-10-13 21:39:40,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1505261.3333333333, ans=0.2 2023-10-13 21:39:50,168 INFO [train.py:1031] (0/4) Epoch 24, batch 8500, loss[loss=0.184, simple_loss=0.2861, pruned_loss=0.04095, over 16922.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2794, pruned_loss=0.04716, over 32358171.72 frames. ], batch size: 93, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 21:40:27,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=1505401.3333333333, ans=0.5 2023-10-13 21:40:53,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1505541.3333333333, ans=0.0 2023-10-13 21:41:20,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1505634.6666666667, ans=0.0 2023-10-13 21:41:21,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1505634.6666666667, ans=0.0 2023-10-13 21:41:38,317 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.976e+02 2.106e+02 2.450e+02 3.139e+02, threshold=4.212e+02, percent-clipped=0.0 2023-10-13 21:41:45,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1505728.0, ans=0.125 2023-10-13 21:41:57,935 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-10-13 21:42:05,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1505774.6666666667, ans=0.125 2023-10-13 21:42:18,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1505821.3333333333, ans=0.125 2023-10-13 21:42:42,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1505868.0, ans=0.1 2023-10-13 21:42:46,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.32 vs. limit=15.0 2023-10-13 21:42:49,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=15.0 2023-10-13 21:42:55,330 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:43:09,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1505961.3333333333, ans=0.2 2023-10-13 21:43:22,812 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-10-13 21:43:26,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1506008.0, ans=0.125 2023-10-13 21:43:29,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1506054.6666666667, ans=0.5 2023-10-13 21:43:44,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1506101.3333333333, ans=0.125 2023-10-13 21:43:50,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.70 vs. limit=15.0 2023-10-13 21:44:07,314 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.735e+02 1.925e+02 2.341e+02 2.945e+02, threshold=3.850e+02, percent-clipped=0.0 2023-10-13 21:44:09,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1506148.0, ans=0.0 2023-10-13 21:44:28,272 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-10-13 21:44:30,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1506241.3333333333, ans=10.0 2023-10-13 21:44:34,018 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.16 vs. limit=22.5 2023-10-13 21:44:55,444 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.71 vs. limit=22.5 2023-10-13 21:45:30,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1506474.6666666667, ans=0.125 2023-10-13 21:45:40,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1506474.6666666667, ans=0.125 2023-10-13 21:45:55,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1506521.3333333333, ans=0.125 2023-10-13 21:45:56,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1506521.3333333333, ans=0.1 2023-10-13 21:46:03,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1506568.0, ans=0.125 2023-10-13 21:46:15,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1506614.6666666667, ans=0.09899494936611666 2023-10-13 21:46:24,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.755e+02 1.951e+02 2.208e+02 2.996e+02, threshold=3.902e+02, percent-clipped=0.0 2023-10-13 21:46:24,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1506614.6666666667, ans=10.0 2023-10-13 21:47:12,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1506801.3333333333, ans=0.0 2023-10-13 21:47:43,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.96 vs. limit=15.0 2023-10-13 21:47:46,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1506941.3333333333, ans=0.1 2023-10-13 21:47:47,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1506941.3333333333, ans=0.125 2023-10-13 21:47:58,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1506988.0, ans=0.125 2023-10-13 21:48:28,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1507081.3333333333, ans=0.125 2023-10-13 21:48:29,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.777e+02 1.964e+02 2.237e+02 3.329e+02, threshold=3.929e+02, percent-clipped=0.0 2023-10-13 21:48:29,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1507081.3333333333, ans=0.125 2023-10-13 21:49:04,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.54 vs. limit=6.0 2023-10-13 21:49:08,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1507221.3333333333, ans=0.1 2023-10-13 21:49:10,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1507221.3333333333, ans=0.0 2023-10-13 21:49:11,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1507221.3333333333, ans=0.125 2023-10-13 21:49:26,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1507268.0, ans=0.0 2023-10-13 21:49:29,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.65 vs. limit=15.0 2023-10-13 21:49:47,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1507361.3333333333, ans=0.2 2023-10-13 21:49:56,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1507408.0, ans=0.2 2023-10-13 21:50:11,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1507454.6666666667, ans=0.05 2023-10-13 21:50:30,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1507548.0, ans=0.1 2023-10-13 21:50:37,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.798e+02 1.992e+02 2.187e+02 2.759e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-13 21:50:54,676 INFO [train.py:1031] (0/4) Epoch 24, batch 9000, loss[loss=0.2006, simple_loss=0.3006, pruned_loss=0.05026, over 16823.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.279, pruned_loss=0.04711, over 32475247.34 frames. ], batch size: 188, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 21:51:04,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1507641.3333333333, ans=0.0 2023-10-13 21:51:08,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1507688.0, ans=0.1 2023-10-13 21:51:29,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1507734.6666666667, ans=0.1 2023-10-13 21:51:38,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1507781.3333333333, ans=0.0 2023-10-13 21:51:50,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1507828.0, ans=0.05 2023-10-13 21:51:51,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1507828.0, ans=0.05 2023-10-13 21:51:55,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1507874.6666666667, ans=0.09899494936611666 2023-10-13 21:51:55,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.44 vs. limit=15.0 2023-10-13 21:52:07,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1507921.3333333333, ans=0.125 2023-10-13 21:52:19,406 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:52:22,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1507968.0, ans=0.0 2023-10-13 21:52:34,964 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.777e+02 1.911e+02 2.138e+02 4.741e+02, threshold=3.821e+02, percent-clipped=1.0 2023-10-13 21:52:39,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1508061.3333333333, ans=0.125 2023-10-13 21:53:22,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1508201.3333333333, ans=0.0 2023-10-13 21:53:32,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1508248.0, ans=0.0 2023-10-13 21:53:45,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1508294.6666666667, ans=0.0 2023-10-13 21:53:45,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1508294.6666666667, ans=0.125 2023-10-13 21:53:48,305 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.26 vs. limit=15.0 2023-10-13 21:54:38,376 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.05 vs. limit=15.0 2023-10-13 21:54:40,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.887e+02 2.094e+02 2.329e+02 3.341e+02, threshold=4.189e+02, percent-clipped=0.0 2023-10-13 21:54:46,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1508528.0, ans=0.0 2023-10-13 21:54:48,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1508528.0, ans=0.125 2023-10-13 21:54:55,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1508574.6666666667, ans=0.125 2023-10-13 21:54:58,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1508574.6666666667, ans=0.125 2023-10-13 21:55:39,265 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.07 vs. limit=15.0 2023-10-13 21:55:51,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1508761.3333333333, ans=0.2 2023-10-13 21:55:51,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1508761.3333333333, ans=0.125 2023-10-13 21:55:51,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.02 vs. limit=22.5 2023-10-13 21:56:02,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1508808.0, ans=0.0 2023-10-13 21:56:08,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-10-13 21:56:10,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1508854.6666666667, ans=0.0 2023-10-13 21:56:21,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1508901.3333333333, ans=0.1 2023-10-13 21:56:21,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1508901.3333333333, ans=0.0 2023-10-13 21:56:33,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1508948.0, ans=0.125 2023-10-13 21:56:35,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.847e+02 2.002e+02 2.224e+02 2.855e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-13 21:56:42,806 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 21:56:43,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1508994.6666666667, ans=0.125 2023-10-13 21:56:48,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1509041.3333333333, ans=0.0 2023-10-13 21:56:52,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1509041.3333333333, ans=0.0 2023-10-13 21:56:57,692 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.59 vs. limit=10.0 2023-10-13 21:57:16,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1509134.6666666667, ans=0.0 2023-10-13 21:57:18,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1509134.6666666667, ans=0.125 2023-10-13 21:57:21,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1509134.6666666667, ans=0.1 2023-10-13 21:57:23,495 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-10-13 21:57:32,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1509181.3333333333, ans=0.0 2023-10-13 21:57:39,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1509228.0, ans=0.125 2023-10-13 21:57:53,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-10-13 21:58:10,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1509321.3333333333, ans=0.2 2023-10-13 21:58:10,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1509321.3333333333, ans=0.125 2023-10-13 21:58:15,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1509321.3333333333, ans=0.125 2023-10-13 21:58:22,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1509368.0, ans=0.1 2023-10-13 21:58:28,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1509368.0, ans=0.1 2023-10-13 21:58:42,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1509414.6666666667, ans=0.025 2023-10-13 21:58:43,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.798e+02 1.973e+02 2.115e+02 3.364e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-13 21:58:58,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.28 vs. limit=10.0 2023-10-13 21:59:02,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1509508.0, ans=0.125 2023-10-13 21:59:31,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1509601.3333333333, ans=0.1 2023-10-13 21:59:58,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1509741.3333333333, ans=0.0 2023-10-13 22:00:00,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1509741.3333333333, ans=0.0 2023-10-13 22:00:02,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1509741.3333333333, ans=0.05 2023-10-13 22:00:26,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1509834.6666666667, ans=0.2 2023-10-13 22:00:46,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.835e+02 2.050e+02 2.297e+02 3.409e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-13 22:00:52,173 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:00:56,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1509928.0, ans=0.1 2023-10-13 22:00:56,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1509928.0, ans=0.125 2023-10-13 22:01:01,174 INFO [train.py:1031] (0/4) Epoch 24, batch 9500, loss[loss=0.1794, simple_loss=0.2785, pruned_loss=0.0402, over 16626.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2798, pruned_loss=0.0474, over 32574643.35 frames. ], batch size: 241, lr: 1.43e-03, grad_scale: 16.0 2023-10-13 22:01:02,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1509974.6666666667, ans=0.125 2023-10-13 22:01:16,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1510021.3333333333, ans=0.125 2023-10-13 22:01:18,127 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-10-13 22:01:46,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1510114.6666666667, ans=0.125 2023-10-13 22:01:50,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1510161.3333333333, ans=0.125 2023-10-13 22:01:52,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1510161.3333333333, ans=0.125 2023-10-13 22:01:53,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1510161.3333333333, ans=0.125 2023-10-13 22:01:56,629 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.35 vs. limit=15.0 2023-10-13 22:02:12,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-10-13 22:02:13,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1510254.6666666667, ans=0.125 2023-10-13 22:02:23,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.50 vs. limit=15.0 2023-10-13 22:02:48,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1510348.0, ans=0.125 2023-10-13 22:02:48,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=12.0 2023-10-13 22:02:51,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.807e+02 2.011e+02 2.342e+02 3.306e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-13 22:03:31,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1510488.0, ans=0.125 2023-10-13 22:04:08,547 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.142e-02 2023-10-13 22:04:22,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1510674.6666666667, ans=0.125 2023-10-13 22:04:26,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1510674.6666666667, ans=0.125 2023-10-13 22:05:02,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1510814.6666666667, ans=0.2 2023-10-13 22:05:05,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1510814.6666666667, ans=0.125 2023-10-13 22:05:11,314 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.735e+02 1.864e+02 2.171e+02 3.387e+02, threshold=3.727e+02, percent-clipped=0.0 2023-10-13 22:05:12,574 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:05:25,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1510861.3333333333, ans=0.0 2023-10-13 22:05:25,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1510861.3333333333, ans=0.2 2023-10-13 22:05:36,313 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.71 vs. limit=10.0 2023-10-13 22:05:49,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1510954.6666666667, ans=0.2 2023-10-13 22:06:11,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1511048.0, ans=0.0 2023-10-13 22:06:14,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1511048.0, ans=0.0 2023-10-13 22:06:19,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1511094.6666666667, ans=0.2 2023-10-13 22:06:41,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1511141.3333333333, ans=0.09899494936611666 2023-10-13 22:06:41,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1511141.3333333333, ans=0.0 2023-10-13 22:06:42,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1511188.0, ans=0.2 2023-10-13 22:07:04,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1511234.6666666667, ans=0.2 2023-10-13 22:07:17,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.834e+02 2.020e+02 2.337e+02 3.582e+02, threshold=4.040e+02, percent-clipped=0.0 2023-10-13 22:07:41,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1511374.6666666667, ans=0.125 2023-10-13 22:07:52,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1511421.3333333333, ans=0.2 2023-10-13 22:08:10,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1511514.6666666667, ans=0.125 2023-10-13 22:08:11,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.69 vs. limit=22.5 2023-10-13 22:08:13,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1511514.6666666667, ans=0.0 2023-10-13 22:08:37,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1511608.0, ans=0.125 2023-10-13 22:08:48,040 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.88 vs. limit=12.0 2023-10-13 22:09:08,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1511748.0, ans=0.125 2023-10-13 22:09:17,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.720e+02 1.869e+02 2.083e+02 3.448e+02, threshold=3.738e+02, percent-clipped=0.0 2023-10-13 22:09:22,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1511794.6666666667, ans=0.0 2023-10-13 22:09:29,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1511794.6666666667, ans=0.125 2023-10-13 22:09:57,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1511934.6666666667, ans=0.125 2023-10-13 22:09:58,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1511934.6666666667, ans=0.125 2023-10-13 22:10:12,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1511981.3333333333, ans=0.125 2023-10-13 22:10:42,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-10-13 22:10:48,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1512121.3333333333, ans=0.125 2023-10-13 22:10:51,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.97 vs. limit=22.5 2023-10-13 22:10:51,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1512121.3333333333, ans=0.125 2023-10-13 22:10:58,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1512168.0, ans=0.2 2023-10-13 22:11:10,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.48 vs. limit=12.0 2023-10-13 22:11:13,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.777e+02 1.989e+02 2.271e+02 3.617e+02, threshold=3.978e+02, percent-clipped=0.0 2023-10-13 22:11:15,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1512214.6666666667, ans=0.1 2023-10-13 22:11:18,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.57 vs. limit=12.0 2023-10-13 22:11:27,580 INFO [train.py:1031] (0/4) Epoch 24, batch 10000, loss[loss=0.1705, simple_loss=0.2622, pruned_loss=0.03943, over 16455.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.279, pruned_loss=0.04708, over 32646497.71 frames. ], batch size: 50, lr: 1.43e-03, grad_scale: 32.0 2023-10-13 22:11:27,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1512308.0, ans=0.0 2023-10-13 22:11:51,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1512401.3333333333, ans=0.1 2023-10-13 22:11:53,950 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:12:27,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1512541.3333333333, ans=0.2 2023-10-13 22:13:11,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1512681.3333333333, ans=0.125 2023-10-13 22:13:18,592 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.847e+02 1.993e+02 2.292e+02 3.385e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-13 22:13:21,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1512728.0, ans=0.0 2023-10-13 22:13:33,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1512728.0, ans=0.0 2023-10-13 22:13:35,761 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:13:43,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1512774.6666666667, ans=0.125 2023-10-13 22:14:05,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1512821.3333333333, ans=0.0 2023-10-13 22:14:34,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1512961.3333333333, ans=0.0 2023-10-13 22:14:35,354 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.76 vs. limit=15.0 2023-10-13 22:15:03,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1513054.6666666667, ans=0.125 2023-10-13 22:15:08,202 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:15:23,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1513148.0, ans=0.125 2023-10-13 22:15:26,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.844e+02 2.032e+02 2.209e+02 2.810e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-13 22:15:35,542 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-10-13 22:16:01,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513288.0, ans=0.1 2023-10-13 22:16:17,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1513334.6666666667, ans=0.125 2023-10-13 22:16:24,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1513381.3333333333, ans=0.125 2023-10-13 22:16:34,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513428.0, ans=0.1 2023-10-13 22:16:42,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1513428.0, ans=0.125 2023-10-13 22:16:50,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1513474.6666666667, ans=0.125 2023-10-13 22:16:50,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-10-13 22:17:31,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.773e+02 1.902e+02 2.049e+02 2.632e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-13 22:17:52,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513708.0, ans=0.1 2023-10-13 22:18:23,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1513848.0, ans=0.125 2023-10-13 22:18:26,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1513848.0, ans=0.0 2023-10-13 22:19:03,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.04 vs. limit=12.0 2023-10-13 22:19:04,559 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:19:10,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1513988.0, ans=0.0 2023-10-13 22:19:31,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1514081.3333333333, ans=0.2 2023-10-13 22:19:35,582 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.795e+02 1.975e+02 2.243e+02 3.397e+02, threshold=3.949e+02, percent-clipped=0.0 2023-10-13 22:19:53,514 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:19:55,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1514174.6666666667, ans=0.125 2023-10-13 22:19:56,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1514174.6666666667, ans=0.1 2023-10-13 22:20:09,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1514221.3333333333, ans=0.2 2023-10-13 22:20:17,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1514268.0, ans=0.1 2023-10-13 22:20:21,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1514268.0, ans=0.0 2023-10-13 22:20:47,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.12 vs. limit=15.0 2023-10-13 22:20:47,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1514361.3333333333, ans=0.1 2023-10-13 22:21:17,304 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.89 vs. limit=15.0 2023-10-13 22:21:23,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1514501.3333333333, ans=0.1 2023-10-13 22:21:29,752 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:21:35,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1514548.0, ans=0.2 2023-10-13 22:21:40,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.753e+02 1.869e+02 2.082e+02 2.888e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-13 22:21:41,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1514548.0, ans=0.125 2023-10-13 22:21:41,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1514548.0, ans=0.0 2023-10-13 22:21:57,553 INFO [train.py:1031] (0/4) Epoch 24, batch 10500, loss[loss=0.1954, simple_loss=0.2942, pruned_loss=0.04834, over 16936.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2796, pruned_loss=0.04736, over 32664977.50 frames. ], batch size: 165, lr: 1.43e-03, grad_scale: 32.0 2023-10-13 22:21:59,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1514641.3333333333, ans=0.0 2023-10-13 22:22:02,962 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:22:15,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1514688.0, ans=0.0 2023-10-13 22:22:17,923 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.30 vs. limit=15.0 2023-10-13 22:22:17,981 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-10-13 22:22:51,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1514828.0, ans=0.0 2023-10-13 22:22:54,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1514828.0, ans=0.05 2023-10-13 22:23:07,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1514874.6666666667, ans=0.07 2023-10-13 22:23:33,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1514968.0, ans=0.125 2023-10-13 22:23:40,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-10-13 22:23:46,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1515014.6666666667, ans=0.1 2023-10-13 22:23:53,868 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.828e+02 1.976e+02 2.160e+02 3.338e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-13 22:24:29,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=1515154.6666666667, ans=6.0 2023-10-13 22:24:34,954 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:24:35,400 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.13 vs. limit=6.0 2023-10-13 22:25:16,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-10-13 22:25:22,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1515341.3333333333, ans=15.0 2023-10-13 22:25:41,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=15.0 2023-10-13 22:25:42,610 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.15 vs. limit=15.0 2023-10-13 22:26:00,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1515481.3333333333, ans=0.125 2023-10-13 22:26:05,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.842e+02 1.948e+02 2.162e+02 2.869e+02, threshold=3.895e+02, percent-clipped=0.0 2023-10-13 22:26:24,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1515574.6666666667, ans=0.125 2023-10-13 22:26:42,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1515621.3333333333, ans=0.1 2023-10-13 22:26:48,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1515621.3333333333, ans=0.125 2023-10-13 22:27:03,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1515714.6666666667, ans=0.2 2023-10-13 22:27:09,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1515714.6666666667, ans=0.2 2023-10-13 22:27:09,478 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.54 vs. limit=15.0 2023-10-13 22:27:28,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1515808.0, ans=0.2 2023-10-13 22:27:30,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1515808.0, ans=0.125 2023-10-13 22:27:38,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1515808.0, ans=0.125 2023-10-13 22:27:42,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1515854.6666666667, ans=0.1 2023-10-13 22:28:04,134 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:28:04,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1515901.3333333333, ans=10.0 2023-10-13 22:28:04,358 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.41 vs. limit=15.0 2023-10-13 22:28:16,284 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.901e+02 2.142e+02 2.481e+02 3.748e+02, threshold=4.284e+02, percent-clipped=0.0 2023-10-13 22:28:22,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1515994.6666666667, ans=0.1 2023-10-13 22:28:27,227 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2023-10-13 22:28:38,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1516041.3333333333, ans=0.0 2023-10-13 22:28:54,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1516134.6666666667, ans=0.125 2023-10-13 22:28:56,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1516134.6666666667, ans=0.1 2023-10-13 22:29:02,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-10-13 22:29:06,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-10-13 22:29:52,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1516321.3333333333, ans=0.125 2023-10-13 22:29:57,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1516368.0, ans=0.0 2023-10-13 22:30:15,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.751e+02 1.974e+02 2.144e+02 2.984e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-13 22:30:30,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1516508.0, ans=0.2 2023-10-13 22:30:52,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1516601.3333333333, ans=0.125 2023-10-13 22:31:00,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-10-13 22:31:05,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1516648.0, ans=0.0 2023-10-13 22:31:20,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-10-13 22:31:37,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.87 vs. limit=15.0 2023-10-13 22:32:05,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1516881.3333333333, ans=0.0 2023-10-13 22:32:09,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.774e+02 1.959e+02 2.223e+02 3.325e+02, threshold=3.919e+02, percent-clipped=0.0 2023-10-13 22:32:18,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1516928.0, ans=0.0 2023-10-13 22:32:18,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1516928.0, ans=0.0 2023-10-13 22:32:24,128 INFO [train.py:1031] (0/4) Epoch 24, batch 11000, loss[loss=0.2399, simple_loss=0.3141, pruned_loss=0.08288, over 15718.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2796, pruned_loss=0.0474, over 32698571.70 frames. ], batch size: 350, lr: 1.42e-03, grad_scale: 32.0 2023-10-13 22:32:36,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.51 vs. limit=6.0 2023-10-13 22:32:37,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1517021.3333333333, ans=0.125 2023-10-13 22:32:41,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.30 vs. limit=15.0 2023-10-13 22:32:58,685 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.14 vs. limit=12.0 2023-10-13 22:33:29,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.11 vs. limit=15.0 2023-10-13 22:33:31,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1517208.0, ans=0.125 2023-10-13 22:33:31,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.39 vs. limit=15.0 2023-10-13 22:34:09,672 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.852e+02 1.978e+02 2.339e+02 3.502e+02, threshold=3.957e+02, percent-clipped=0.0 2023-10-13 22:34:15,711 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.95 vs. limit=15.0 2023-10-13 22:34:24,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1517441.3333333333, ans=0.125 2023-10-13 22:34:38,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1517488.0, ans=0.07 2023-10-13 22:34:51,088 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-10-13 22:34:54,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1517534.6666666667, ans=0.125 2023-10-13 22:34:56,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1517534.6666666667, ans=0.0 2023-10-13 22:34:59,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1517534.6666666667, ans=0.1 2023-10-13 22:35:14,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1517581.3333333333, ans=0.125 2023-10-13 22:36:18,563 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.716e+02 1.878e+02 2.081e+02 2.828e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-13 22:36:27,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1517861.3333333333, ans=0.0 2023-10-13 22:36:28,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1517861.3333333333, ans=0.1 2023-10-13 22:36:36,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1517908.0, ans=0.1 2023-10-13 22:36:56,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1518001.3333333333, ans=0.05 2023-10-13 22:36:58,862 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.48 vs. limit=22.5 2023-10-13 22:37:18,521 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=15.0 2023-10-13 22:37:22,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1518094.6666666667, ans=0.125 2023-10-13 22:37:38,708 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.62 vs. limit=15.0 2023-10-13 22:37:40,147 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.20 vs. limit=15.0 2023-10-13 22:37:47,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1518188.0, ans=0.125 2023-10-13 22:37:47,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1518188.0, ans=0.09899494936611666 2023-10-13 22:37:58,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1518234.6666666667, ans=0.02 2023-10-13 22:38:24,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.816e+02 1.944e+02 2.162e+02 2.848e+02, threshold=3.887e+02, percent-clipped=0.0 2023-10-13 22:38:29,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1518328.0, ans=0.125 2023-10-13 22:38:37,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1518328.0, ans=0.0 2023-10-13 22:38:37,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1518328.0, ans=0.125 2023-10-13 22:38:43,991 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.92 vs. limit=15.0 2023-10-13 22:38:47,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1518374.6666666667, ans=0.1 2023-10-13 22:38:59,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1518421.3333333333, ans=0.0 2023-10-13 22:39:01,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1518421.3333333333, ans=0.0 2023-10-13 22:39:08,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.11 vs. limit=12.0 2023-10-13 22:39:48,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1518608.0, ans=0.07 2023-10-13 22:39:55,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.66 vs. limit=15.0 2023-10-13 22:40:21,549 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-10-13 22:40:37,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.805e+02 1.963e+02 2.115e+02 3.367e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-13 22:40:51,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1518794.6666666667, ans=0.0 2023-10-13 22:40:59,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1518841.3333333333, ans=0.125 2023-10-13 22:41:17,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.81 vs. limit=10.0 2023-10-13 22:41:40,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1518981.3333333333, ans=0.125 2023-10-13 22:41:47,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1519028.0, ans=0.0 2023-10-13 22:41:49,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1519028.0, ans=0.125 2023-10-13 22:41:58,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1519028.0, ans=0.2 2023-10-13 22:41:59,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1519028.0, ans=0.0 2023-10-13 22:42:20,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.63 vs. limit=15.0 2023-10-13 22:42:26,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1519121.3333333333, ans=0.5 2023-10-13 22:42:28,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1519121.3333333333, ans=0.025 2023-10-13 22:42:44,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.56 vs. limit=15.0 2023-10-13 22:42:48,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1519214.6666666667, ans=0.2 2023-10-13 22:42:57,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 2.072e+02 2.262e+02 2.562e+02 3.409e+02, threshold=4.525e+02, percent-clipped=0.0 2023-10-13 22:43:01,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1519261.3333333333, ans=0.0 2023-10-13 22:43:11,329 INFO [train.py:1031] (0/4) Epoch 24, batch 11500, loss[loss=0.2146, simple_loss=0.3003, pruned_loss=0.06442, over 16068.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2792, pruned_loss=0.04745, over 32706017.92 frames. ], batch size: 296, lr: 1.42e-03, grad_scale: 16.0 2023-10-13 22:43:25,139 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.67 vs. limit=15.0 2023-10-13 22:43:33,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1519401.3333333333, ans=0.125 2023-10-13 22:44:10,229 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.44 vs. limit=15.0 2023-10-13 22:44:22,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1519541.3333333333, ans=0.125 2023-10-13 22:44:39,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1519634.6666666667, ans=0.125 2023-10-13 22:45:03,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1519681.3333333333, ans=0.1 2023-10-13 22:45:03,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.711e+02 1.857e+02 2.023e+02 2.645e+02, threshold=3.715e+02, percent-clipped=0.0 2023-10-13 22:45:08,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.97 vs. limit=10.0 2023-10-13 22:45:10,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1519728.0, ans=0.0 2023-10-13 22:45:16,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1519728.0, ans=0.125 2023-10-13 22:45:23,205 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:45:28,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1519774.6666666667, ans=0.05 2023-10-13 22:45:53,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1519868.0, ans=0.125 2023-10-13 22:46:02,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1519914.6666666667, ans=0.0 2023-10-13 22:46:09,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1519961.3333333333, ans=0.125 2023-10-13 22:46:26,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1520008.0, ans=0.125 2023-10-13 22:46:46,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1520101.3333333333, ans=0.125 2023-10-13 22:46:51,485 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.65 vs. limit=22.5 2023-10-13 22:47:02,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1520148.0, ans=0.1 2023-10-13 22:47:03,951 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:47:06,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.836e+02 2.015e+02 2.307e+02 3.757e+02, threshold=4.031e+02, percent-clipped=1.0 2023-10-13 22:47:10,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1520194.6666666667, ans=0.025 2023-10-13 22:47:15,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1520194.6666666667, ans=0.125 2023-10-13 22:47:43,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1520334.6666666667, ans=0.125 2023-10-13 22:47:45,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1520334.6666666667, ans=0.0 2023-10-13 22:47:53,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1520334.6666666667, ans=0.05 2023-10-13 22:48:05,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-10-13 22:48:18,482 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.68 vs. limit=22.5 2023-10-13 22:48:36,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1520521.3333333333, ans=0.125 2023-10-13 22:48:37,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.92 vs. limit=15.0 2023-10-13 22:49:13,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1520614.6666666667, ans=0.125 2023-10-13 22:49:16,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.804e+02 1.912e+02 2.236e+02 2.756e+02, threshold=3.825e+02, percent-clipped=0.0 2023-10-13 22:49:17,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1520661.3333333333, ans=0.1 2023-10-13 22:49:56,736 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:50:00,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1520801.3333333333, ans=0.0 2023-10-13 22:50:24,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1520894.6666666667, ans=0.0 2023-10-13 22:50:50,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1520988.0, ans=0.2 2023-10-13 22:50:51,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1520988.0, ans=0.125 2023-10-13 22:50:58,306 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=15.0 2023-10-13 22:51:01,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1521034.6666666667, ans=0.2 2023-10-13 22:51:15,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1521081.3333333333, ans=0.125 2023-10-13 22:51:20,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1521081.3333333333, ans=0.125 2023-10-13 22:51:23,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.773e+02 1.904e+02 2.073e+02 2.644e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-13 22:51:32,500 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.13 vs. limit=22.5 2023-10-13 22:51:35,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1521128.0, ans=0.125 2023-10-13 22:51:51,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-10-13 22:52:53,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1521408.0, ans=0.0 2023-10-13 22:52:53,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1521408.0, ans=0.125 2023-10-13 22:53:01,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1521454.6666666667, ans=0.1 2023-10-13 22:53:01,689 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:53:12,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1521501.3333333333, ans=0.125 2023-10-13 22:53:25,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.17 vs. limit=15.0 2023-10-13 22:53:27,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1521548.0, ans=0.125 2023-10-13 22:53:29,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1521548.0, ans=0.125 2023-10-13 22:53:32,623 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.796e+02 1.987e+02 2.373e+02 3.644e+02, threshold=3.974e+02, percent-clipped=0.0 2023-10-13 22:53:35,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1521594.6666666667, ans=0.125 2023-10-13 22:53:36,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1521594.6666666667, ans=0.1 2023-10-13 22:53:38,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1521594.6666666667, ans=0.2 2023-10-13 22:53:46,448 INFO [train.py:1031] (0/4) Epoch 24, batch 12000, loss[loss=0.1827, simple_loss=0.2472, pruned_loss=0.05913, over 12419.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2793, pruned_loss=0.04719, over 32743750.41 frames. ], batch size: 440, lr: 1.42e-03, grad_scale: 32.0 2023-10-13 22:53:48,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1521641.3333333333, ans=0.0 2023-10-13 22:53:51,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1521641.3333333333, ans=0.125 2023-10-13 22:54:09,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1521734.6666666667, ans=0.0 2023-10-13 22:54:11,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1521734.6666666667, ans=0.125 2023-10-13 22:54:25,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1521781.3333333333, ans=0.1 2023-10-13 22:54:40,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1521828.0, ans=0.1 2023-10-13 22:54:41,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1521828.0, ans=0.0 2023-10-13 22:54:43,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1521828.0, ans=0.125 2023-10-13 22:54:49,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1521874.6666666667, ans=0.05 2023-10-13 22:54:52,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.10 vs. limit=15.0 2023-10-13 22:54:55,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1521874.6666666667, ans=0.0 2023-10-13 22:55:15,395 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.91 vs. limit=15.0 2023-10-13 22:55:24,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1522014.6666666667, ans=0.0 2023-10-13 22:55:26,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1522014.6666666667, ans=0.0 2023-10-13 22:55:31,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1522014.6666666667, ans=0.125 2023-10-13 22:55:32,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.790e+02 2.061e+02 2.349e+02 3.231e+02, threshold=4.123e+02, percent-clipped=0.0 2023-10-13 22:55:55,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522108.0, ans=0.1 2023-10-13 22:56:06,336 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.31 vs. limit=15.0 2023-10-13 22:56:59,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1522388.0, ans=0.0 2023-10-13 22:57:06,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.65 vs. limit=22.5 2023-10-13 22:57:12,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1522434.6666666667, ans=0.125 2023-10-13 22:57:14,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1522434.6666666667, ans=0.125 2023-10-13 22:57:14,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1522434.6666666667, ans=0.0 2023-10-13 22:57:16,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1522434.6666666667, ans=0.2 2023-10-13 22:57:20,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.57 vs. limit=15.0 2023-10-13 22:57:30,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1522481.3333333333, ans=0.04949747468305833 2023-10-13 22:57:32,699 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.840e+02 2.002e+02 2.278e+02 2.944e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-13 22:57:36,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1522528.0, ans=0.0 2023-10-13 22:57:45,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1522574.6666666667, ans=0.125 2023-10-13 22:57:48,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1522574.6666666667, ans=0.125 2023-10-13 22:58:06,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1522621.3333333333, ans=0.125 2023-10-13 22:58:09,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1522621.3333333333, ans=0.0 2023-10-13 22:58:09,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1522621.3333333333, ans=0.1 2023-10-13 22:58:27,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1522714.6666666667, ans=0.0 2023-10-13 22:58:29,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-10-13 22:59:16,785 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.25 vs. limit=6.0 2023-10-13 22:59:22,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1522901.3333333333, ans=0.025 2023-10-13 22:59:23,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1522901.3333333333, ans=0.125 2023-10-13 22:59:35,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1522948.0, ans=0.125 2023-10-13 22:59:36,076 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.827e+02 1.975e+02 2.172e+02 3.021e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-13 22:59:39,986 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 22:59:48,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-10-13 22:59:51,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.81 vs. limit=15.0 2023-10-13 22:59:56,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1523041.3333333333, ans=0.5 2023-10-13 22:59:57,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1523041.3333333333, ans=0.95 2023-10-13 23:00:05,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1523088.0, ans=0.125 2023-10-13 23:00:32,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1523181.3333333333, ans=0.1 2023-10-13 23:01:12,445 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-10-13 23:01:25,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-10-13 23:01:39,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.799e+02 1.936e+02 2.110e+02 2.856e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-13 23:01:48,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1523461.3333333333, ans=0.125 2023-10-13 23:02:08,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1523554.6666666667, ans=0.0 2023-10-13 23:02:12,358 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-10-13 23:02:34,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.14 vs. limit=22.5 2023-10-13 23:02:40,803 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.22 vs. limit=12.0 2023-10-13 23:02:46,972 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-13 23:02:54,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1523741.3333333333, ans=0.125 2023-10-13 23:03:00,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=22.5 2023-10-13 23:03:03,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1523788.0, ans=0.125 2023-10-13 23:03:22,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1523834.6666666667, ans=0.2 2023-10-13 23:03:38,698 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.859e+02 2.002e+02 2.187e+02 2.889e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-13 23:03:41,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1523928.0, ans=0.05 2023-10-13 23:03:45,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.89 vs. limit=15.0 2023-10-13 23:03:46,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1523928.0, ans=0.125 2023-10-13 23:03:52,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1523974.6666666667, ans=0.0 2023-10-13 23:03:52,670 INFO [train.py:1031] (0/4) Epoch 24, batch 12500, loss[loss=0.2476, simple_loss=0.3102, pruned_loss=0.09249, over 15650.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2789, pruned_loss=0.04732, over 32737616.16 frames. ], batch size: 350, lr: 1.42e-03, grad_scale: 8.0 2023-10-13 23:03:53,393 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.48 vs. limit=22.5 2023-10-13 23:03:55,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1523974.6666666667, ans=0.0 2023-10-13 23:04:00,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1523974.6666666667, ans=0.1 2023-10-13 23:04:27,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1524114.6666666667, ans=0.125 2023-10-13 23:04:29,181 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:05:06,151 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.57 vs. limit=15.0 2023-10-13 23:05:17,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1524301.3333333333, ans=0.125 2023-10-13 23:05:20,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-10-13 23:05:22,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1524301.3333333333, ans=0.125 2023-10-13 23:05:38,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1524348.0, ans=10.0 2023-10-13 23:05:40,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.805e+02 1.914e+02 2.061e+02 2.800e+02, threshold=3.829e+02, percent-clipped=0.0 2023-10-13 23:05:40,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1524394.6666666667, ans=0.125 2023-10-13 23:05:46,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1524394.6666666667, ans=0.125 2023-10-13 23:06:00,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1524441.3333333333, ans=0.0 2023-10-13 23:06:03,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1524488.0, ans=0.0 2023-10-13 23:06:20,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1524534.6666666667, ans=0.125 2023-10-13 23:06:25,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1524534.6666666667, ans=0.07 2023-10-13 23:06:48,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1524628.0, ans=0.125 2023-10-13 23:07:01,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1524674.6666666667, ans=0.1 2023-10-13 23:07:33,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1524814.6666666667, ans=0.125 2023-10-13 23:07:36,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524861.3333333333, ans=0.1 2023-10-13 23:07:36,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.863e+02 2.040e+02 2.295e+02 3.466e+02, threshold=4.080e+02, percent-clipped=0.0 2023-10-13 23:07:54,225 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.27 vs. limit=10.0 2023-10-13 23:08:03,969 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.59 vs. limit=15.0 2023-10-13 23:08:22,612 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.34 vs. limit=15.0 2023-10-13 23:08:29,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1525048.0, ans=0.1 2023-10-13 23:09:08,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1525188.0, ans=0.125 2023-10-13 23:09:10,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.31 vs. limit=15.0 2023-10-13 23:09:11,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1525188.0, ans=0.035 2023-10-13 23:09:13,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1525234.6666666667, ans=0.07 2023-10-13 23:09:15,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1525234.6666666667, ans=0.1 2023-10-13 23:09:39,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1525328.0, ans=0.125 2023-10-13 23:09:39,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.851e+02 1.996e+02 2.248e+02 4.535e+02, threshold=3.993e+02, percent-clipped=1.0 2023-10-13 23:10:45,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.96 vs. limit=10.0 2023-10-13 23:11:01,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1525608.0, ans=0.0 2023-10-13 23:11:02,920 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:11:14,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1525654.6666666667, ans=0.0 2023-10-13 23:11:14,525 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-10-13 23:11:15,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1525654.6666666667, ans=0.125 2023-10-13 23:11:23,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1525701.3333333333, ans=0.1 2023-10-13 23:11:47,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1525794.6666666667, ans=0.2 2023-10-13 23:11:48,377 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.799e+02 1.946e+02 2.116e+02 3.019e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-13 23:11:53,144 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:11:53,603 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.56 vs. limit=12.0 2023-10-13 23:12:38,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1525981.3333333333, ans=0.2 2023-10-13 23:12:43,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1525981.3333333333, ans=0.125 2023-10-13 23:13:27,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1526168.0, ans=0.125 2023-10-13 23:13:46,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1526214.6666666667, ans=0.125 2023-10-13 23:13:48,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.829e+02 2.058e+02 2.276e+02 3.037e+02, threshold=4.115e+02, percent-clipped=0.0 2023-10-13 23:13:56,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1526261.3333333333, ans=0.125 2023-10-13 23:13:57,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1526261.3333333333, ans=0.1 2023-10-13 23:13:59,262 INFO [train.py:1031] (0/4) Epoch 24, batch 13000, loss[loss=0.1799, simple_loss=0.2733, pruned_loss=0.04328, over 16545.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2796, pruned_loss=0.04733, over 32768803.40 frames. ], batch size: 241, lr: 1.42e-03, grad_scale: 16.0 2023-10-13 23:14:02,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1526308.0, ans=0.125 2023-10-13 23:14:31,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1526401.3333333333, ans=0.0 2023-10-13 23:14:33,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1526401.3333333333, ans=0.0 2023-10-13 23:14:41,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1526448.0, ans=0.125 2023-10-13 23:14:56,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1526494.6666666667, ans=0.125 2023-10-13 23:15:18,583 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.94 vs. limit=15.0 2023-10-13 23:15:24,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1526588.0, ans=0.0 2023-10-13 23:15:28,223 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:15:34,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1526588.0, ans=0.125 2023-10-13 23:15:49,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1526681.3333333333, ans=0.0 2023-10-13 23:16:03,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.815e+02 2.009e+02 2.199e+02 3.102e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-13 23:16:03,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1526728.0, ans=0.125 2023-10-13 23:16:11,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1526728.0, ans=0.125 2023-10-13 23:16:33,768 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.54 vs. limit=22.5 2023-10-13 23:16:35,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1526821.3333333333, ans=0.125 2023-10-13 23:16:42,493 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-10-13 23:17:17,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1527008.0, ans=0.0 2023-10-13 23:17:38,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1527101.3333333333, ans=0.2 2023-10-13 23:17:43,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1527101.3333333333, ans=0.2 2023-10-13 23:17:45,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1527101.3333333333, ans=0.0 2023-10-13 23:17:47,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1527101.3333333333, ans=0.125 2023-10-13 23:17:53,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1527148.0, ans=10.0 2023-10-13 23:17:57,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.55 vs. limit=15.0 2023-10-13 23:18:09,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.769e+02 1.979e+02 2.273e+02 3.106e+02, threshold=3.958e+02, percent-clipped=0.0 2023-10-13 23:18:10,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1527194.6666666667, ans=0.0 2023-10-13 23:18:25,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1527241.3333333333, ans=0.125 2023-10-13 23:18:37,876 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.02 vs. limit=15.0 2023-10-13 23:19:04,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.05 vs. limit=15.0 2023-10-13 23:19:06,197 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.44 vs. limit=15.0 2023-10-13 23:19:26,663 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.56 vs. limit=22.5 2023-10-13 23:19:39,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1527521.3333333333, ans=0.125 2023-10-13 23:19:48,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1527568.0, ans=0.2 2023-10-13 23:19:50,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1527568.0, ans=0.125 2023-10-13 23:20:01,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1527614.6666666667, ans=0.05 2023-10-13 23:20:09,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.830e+02 1.986e+02 2.279e+02 2.838e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-13 23:20:14,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1527661.3333333333, ans=0.125 2023-10-13 23:20:26,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1527708.0, ans=0.1 2023-10-13 23:20:27,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1527708.0, ans=0.125 2023-10-13 23:21:02,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1527848.0, ans=0.0 2023-10-13 23:21:08,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1527894.6666666667, ans=0.125 2023-10-13 23:21:11,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1527894.6666666667, ans=0.0 2023-10-13 23:21:20,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1527941.3333333333, ans=0.1 2023-10-13 23:21:22,202 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.13 vs. limit=15.0 2023-10-13 23:21:33,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1527988.0, ans=0.5 2023-10-13 23:21:58,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1528081.3333333333, ans=0.125 2023-10-13 23:22:08,522 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.796e+02 1.955e+02 2.122e+02 6.659e+02, threshold=3.910e+02, percent-clipped=1.0 2023-10-13 23:22:18,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1528174.6666666667, ans=0.125 2023-10-13 23:22:20,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1528174.6666666667, ans=0.1 2023-10-13 23:22:20,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.90 vs. limit=15.0 2023-10-13 23:22:35,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1528221.3333333333, ans=0.125 2023-10-13 23:22:39,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1528221.3333333333, ans=0.2 2023-10-13 23:22:43,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1528268.0, ans=0.125 2023-10-13 23:22:47,107 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-10-13 23:22:48,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1528268.0, ans=0.0 2023-10-13 23:22:55,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1528314.6666666667, ans=0.2 2023-10-13 23:23:23,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1528408.0, ans=0.1 2023-10-13 23:23:26,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1528408.0, ans=0.1 2023-10-13 23:23:27,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1528408.0, ans=0.125 2023-10-13 23:23:41,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1528501.3333333333, ans=0.0 2023-10-13 23:24:00,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1528548.0, ans=0.0 2023-10-13 23:24:01,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1528548.0, ans=0.125 2023-10-13 23:24:02,034 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.11 vs. limit=22.5 2023-10-13 23:24:05,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.735e+02 1.892e+02 2.089e+02 3.838e+02, threshold=3.784e+02, percent-clipped=0.0 2023-10-13 23:24:15,557 INFO [train.py:1031] (0/4) Epoch 24, batch 13500, loss[loss=0.1757, simple_loss=0.2736, pruned_loss=0.0389, over 16926.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2791, pruned_loss=0.04728, over 32773739.79 frames. ], batch size: 104, lr: 1.42e-03, grad_scale: 16.0 2023-10-13 23:24:34,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1528688.0, ans=0.1 2023-10-13 23:25:47,931 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=15.0 2023-10-13 23:25:49,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1529014.6666666667, ans=0.125 2023-10-13 23:25:50,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1529014.6666666667, ans=0.0 2023-10-13 23:26:03,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.809e+02 1.977e+02 2.163e+02 3.315e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-13 23:26:05,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1529061.3333333333, ans=0.0 2023-10-13 23:26:24,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-10-13 23:26:46,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1529248.0, ans=0.125 2023-10-13 23:27:12,307 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-24.pt 2023-10-13 23:27:42,143 INFO [train.py:1031] (0/4) Epoch 25, batch 0, loss[loss=0.1583, simple_loss=0.2542, pruned_loss=0.03124, over 16933.00 frames. ], tot_loss[loss=0.1583, simple_loss=0.2542, pruned_loss=0.03124, over 16933.00 frames. ], batch size: 77, lr: 1.39e-03, grad_scale: 32.0 2023-10-13 23:27:42,145 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-13 23:27:51,632 INFO [train.py:1063] (0/4) Epoch 25, validation: loss=0.2131, simple_loss=0.2998, pruned_loss=0.06319, over 1020973.00 frames. 2023-10-13 23:27:51,632 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-13 23:28:28,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1529504.6666666667, ans=0.0 2023-10-13 23:28:30,193 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-10-13 23:28:36,187 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.770e+02 1.984e+02 2.294e+02 3.341e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-13 23:28:57,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-10-13 23:29:25,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1529691.3333333333, ans=0.125 2023-10-13 23:29:39,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1529738.0, ans=0.125 2023-10-13 23:29:46,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.84 vs. limit=10.0 2023-10-13 23:30:04,948 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-10-13 23:30:23,968 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.77 vs. limit=12.0 2023-10-13 23:30:24,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1529971.3333333333, ans=0.04949747468305833 2023-10-13 23:30:33,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.761e+02 1.852e+02 2.014e+02 2.887e+02, threshold=3.703e+02, percent-clipped=0.0 2023-10-13 23:30:58,158 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=4.132e-02 2023-10-13 23:31:18,281 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.33 vs. limit=15.0 2023-10-13 23:31:18,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1530158.0, ans=0.1 2023-10-13 23:31:27,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1530204.6666666667, ans=0.2 2023-10-13 23:31:39,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1530251.3333333333, ans=0.1 2023-10-13 23:32:21,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1530438.0, ans=0.125 2023-10-13 23:32:23,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1530438.0, ans=0.2 2023-10-13 23:32:24,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1530438.0, ans=0.2 2023-10-13 23:32:25,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.826e+02 1.997e+02 2.261e+02 3.100e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-13 23:32:49,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1530531.3333333333, ans=0.125 2023-10-13 23:32:52,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1530531.3333333333, ans=0.2 2023-10-13 23:33:01,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1530578.0, ans=0.1 2023-10-13 23:33:11,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-10-13 23:33:11,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1530624.6666666667, ans=0.5 2023-10-13 23:33:18,429 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-328000.pt 2023-10-13 23:33:28,744 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.62 vs. limit=6.0 2023-10-13 23:33:39,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1530718.0, ans=0.2 2023-10-13 23:33:41,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1530718.0, ans=0.0 2023-10-13 23:33:41,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1530718.0, ans=0.125 2023-10-13 23:33:57,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1530811.3333333333, ans=0.0 2023-10-13 23:33:59,504 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.78 vs. limit=15.0 2023-10-13 23:34:25,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1530904.6666666667, ans=0.1 2023-10-13 23:34:26,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.65 vs. limit=10.0 2023-10-13 23:34:26,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1530904.6666666667, ans=0.125 2023-10-13 23:34:27,445 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.817e+02 1.969e+02 2.186e+02 3.001e+02, threshold=3.938e+02, percent-clipped=0.0 2023-10-13 23:34:37,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1530951.3333333333, ans=0.035 2023-10-13 23:34:45,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1530998.0, ans=0.125 2023-10-13 23:35:06,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1531091.3333333333, ans=0.125 2023-10-13 23:35:51,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1531278.0, ans=0.2 2023-10-13 23:35:51,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1531278.0, ans=0.125 2023-10-13 23:36:11,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1531324.6666666667, ans=0.125 2023-10-13 23:36:18,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1531371.3333333333, ans=0.125 2023-10-13 23:36:19,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1531371.3333333333, ans=0.1 2023-10-13 23:36:21,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.855e+02 2.081e+02 2.292e+02 3.776e+02, threshold=4.162e+02, percent-clipped=0.0 2023-10-13 23:36:24,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1531418.0, ans=0.125 2023-10-13 23:36:28,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1531418.0, ans=0.2 2023-10-13 23:36:44,828 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.83 vs. limit=15.0 2023-10-13 23:36:49,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1531464.6666666667, ans=0.125 2023-10-13 23:36:52,913 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.10 vs. limit=15.0 2023-10-13 23:37:05,803 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.84 vs. limit=22.5 2023-10-13 23:37:16,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.38 vs. limit=15.0 2023-10-13 23:37:30,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1531651.3333333333, ans=0.1 2023-10-13 23:37:36,993 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-10-13 23:37:39,768 INFO [train.py:1031] (0/4) Epoch 25, batch 500, loss[loss=0.1898, simple_loss=0.2832, pruned_loss=0.04823, over 16909.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2799, pruned_loss=0.04805, over 7293891.17 frames. ], batch size: 130, lr: 1.39e-03, grad_scale: 16.0 2023-10-13 23:37:46,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1531698.0, ans=0.2 2023-10-13 23:37:46,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1531698.0, ans=0.0 2023-10-13 23:37:53,818 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:37:57,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1531744.6666666667, ans=0.125 2023-10-13 23:38:07,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1531791.3333333333, ans=0.0 2023-10-13 23:38:18,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1531838.0, ans=0.125 2023-10-13 23:38:21,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.814e+02 1.952e+02 2.226e+02 3.293e+02, threshold=3.904e+02, percent-clipped=0.0 2023-10-13 23:38:47,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.00 vs. limit=15.0 2023-10-13 23:39:02,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1531978.0, ans=0.125 2023-10-13 23:39:55,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1532211.3333333333, ans=0.1 2023-10-13 23:40:23,044 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.878e+02 2.021e+02 2.199e+02 3.014e+02, threshold=4.042e+02, percent-clipped=0.0 2023-10-13 23:41:05,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1532491.3333333333, ans=0.1 2023-10-13 23:41:05,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-10-13 23:41:11,081 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.80 vs. limit=15.0 2023-10-13 23:41:20,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1532538.0, ans=0.0 2023-10-13 23:41:23,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1532538.0, ans=0.2 2023-10-13 23:41:23,870 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-10-13 23:41:41,110 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-13 23:41:49,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1532678.0, ans=0.125 2023-10-13 23:41:53,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1532678.0, ans=0.1 2023-10-13 23:41:55,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1532678.0, ans=0.0 2023-10-13 23:41:55,829 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:41:57,332 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.81 vs. limit=15.0 2023-10-13 23:42:06,061 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.44 vs. limit=22.5 2023-10-13 23:42:22,728 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.865e+02 2.046e+02 2.235e+02 2.852e+02, threshold=4.091e+02, percent-clipped=0.0 2023-10-13 23:42:39,185 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.35 vs. limit=15.0 2023-10-13 23:42:43,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1532864.6666666667, ans=0.125 2023-10-13 23:42:48,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1532911.3333333333, ans=0.1 2023-10-13 23:43:06,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1532958.0, ans=0.125 2023-10-13 23:43:40,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1533098.0, ans=0.125 2023-10-13 23:43:40,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1533098.0, ans=0.125 2023-10-13 23:43:53,644 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:44:21,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1533238.0, ans=0.125 2023-10-13 23:44:21,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1533238.0, ans=0.0 2023-10-13 23:44:26,367 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.769e+02 1.917e+02 2.098e+02 2.654e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-13 23:44:35,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1533284.6666666667, ans=0.0 2023-10-13 23:44:39,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1533331.3333333333, ans=0.125 2023-10-13 23:44:58,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1533378.0, ans=0.125 2023-10-13 23:45:09,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1533424.6666666667, ans=0.125 2023-10-13 23:45:09,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1533424.6666666667, ans=0.0 2023-10-13 23:45:09,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1533424.6666666667, ans=0.125 2023-10-13 23:45:12,818 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.11 vs. limit=12.0 2023-10-13 23:45:16,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1533471.3333333333, ans=0.125 2023-10-13 23:45:29,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1533518.0, ans=0.07 2023-10-13 23:45:33,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=1533518.0, ans=10.0 2023-10-13 23:45:52,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1533611.3333333333, ans=0.125 2023-10-13 23:45:57,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1533611.3333333333, ans=0.125 2023-10-13 23:46:26,710 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.779e+02 1.966e+02 2.153e+02 2.792e+02, threshold=3.931e+02, percent-clipped=0.0 2023-10-13 23:46:34,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1533751.3333333333, ans=0.125 2023-10-13 23:46:43,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1533798.0, ans=0.0 2023-10-13 23:46:52,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1533844.6666666667, ans=0.125 2023-10-13 23:46:54,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1533844.6666666667, ans=0.125 2023-10-13 23:47:17,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1533938.0, ans=0.125 2023-10-13 23:47:17,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1533938.0, ans=0.125 2023-10-13 23:47:21,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.79 vs. limit=10.0 2023-10-13 23:47:25,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1533984.6666666667, ans=0.1 2023-10-13 23:47:29,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1533984.6666666667, ans=0.1 2023-10-13 23:47:38,746 INFO [train.py:1031] (0/4) Epoch 25, batch 1000, loss[loss=0.1811, simple_loss=0.2767, pruned_loss=0.04277, over 15932.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2808, pruned_loss=0.0482, over 12938027.37 frames. ], batch size: 43, lr: 1.39e-03, grad_scale: 16.0 2023-10-13 23:47:40,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1534031.3333333333, ans=0.125 2023-10-13 23:47:51,295 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.00 vs. limit=15.0 2023-10-13 23:47:51,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1534078.0, ans=0.0 2023-10-13 23:48:06,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1534124.6666666667, ans=0.125 2023-10-13 23:48:11,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-10-13 23:48:11,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-10-13 23:48:21,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.711e+02 1.879e+02 2.079e+02 2.784e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-13 23:48:34,128 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.83 vs. limit=10.0 2023-10-13 23:48:40,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1534264.6666666667, ans=0.0 2023-10-13 23:48:40,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1534264.6666666667, ans=0.0 2023-10-13 23:48:54,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.92 vs. limit=12.0 2023-10-13 23:48:56,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1534358.0, ans=0.2 2023-10-13 23:49:11,772 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-10-13 23:49:20,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1534451.3333333333, ans=0.1 2023-10-13 23:49:55,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1534591.3333333333, ans=0.0 2023-10-13 23:50:11,892 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=15.0 2023-10-13 23:50:19,105 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.776e+02 1.889e+02 2.086e+02 2.979e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-13 23:50:42,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1534778.0, ans=0.125 2023-10-13 23:50:47,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1534778.0, ans=0.0 2023-10-13 23:51:33,906 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:51:43,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1534964.6666666667, ans=0.125 2023-10-13 23:52:19,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1535104.6666666667, ans=0.125 2023-10-13 23:52:23,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1535104.6666666667, ans=0.0 2023-10-13 23:52:27,785 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.396e+02 1.775e+02 2.001e+02 2.469e+02 4.532e+02, threshold=4.002e+02, percent-clipped=2.0 2023-10-13 23:52:42,719 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:52:45,387 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:52:54,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1535244.6666666667, ans=0.125 2023-10-13 23:52:56,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1535244.6666666667, ans=0.0 2023-10-13 23:53:07,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1535291.3333333333, ans=0.0 2023-10-13 23:53:14,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1535338.0, ans=0.125 2023-10-13 23:53:14,463 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.46 vs. limit=15.0 2023-10-13 23:53:28,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1535384.6666666667, ans=0.1 2023-10-13 23:54:15,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1535571.3333333333, ans=0.1 2023-10-13 23:54:15,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1535571.3333333333, ans=0.1 2023-10-13 23:54:23,967 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.789e+02 1.944e+02 2.132e+02 3.188e+02, threshold=3.888e+02, percent-clipped=0.0 2023-10-13 23:54:24,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1535618.0, ans=0.0 2023-10-13 23:54:26,187 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:54:50,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-10-13 23:54:51,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1535711.3333333333, ans=0.125 2023-10-13 23:55:04,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1535758.0, ans=0.0 2023-10-13 23:55:10,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-10-13 23:55:14,812 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.76 vs. limit=10.0 2023-10-13 23:55:18,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1535804.6666666667, ans=0.125 2023-10-13 23:55:23,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1535851.3333333333, ans=0.0 2023-10-13 23:55:37,751 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-13 23:55:40,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.87 vs. limit=15.0 2023-10-13 23:55:46,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1535944.6666666667, ans=0.1 2023-10-13 23:55:56,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1535944.6666666667, ans=0.09899494936611666 2023-10-13 23:56:03,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1535991.3333333333, ans=0.0 2023-10-13 23:56:06,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1535991.3333333333, ans=0.0 2023-10-13 23:56:06,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1535991.3333333333, ans=0.125 2023-10-13 23:56:21,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1536038.0, ans=0.125 2023-10-13 23:56:22,118 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.763e+02 1.946e+02 2.213e+02 3.108e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-13 23:56:22,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1536084.6666666667, ans=0.125 2023-10-13 23:56:26,371 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-10-13 23:56:32,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1536084.6666666667, ans=0.1 2023-10-13 23:56:56,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1536178.0, ans=15.0 2023-10-13 23:56:56,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1536178.0, ans=0.05 2023-10-13 23:57:24,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1536318.0, ans=0.0 2023-10-13 23:57:38,182 INFO [train.py:1031] (0/4) Epoch 25, batch 1500, loss[loss=0.194, simple_loss=0.2823, pruned_loss=0.05286, over 16620.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2791, pruned_loss=0.04746, over 17328034.13 frames. ], batch size: 219, lr: 1.39e-03, grad_scale: 32.0 2023-10-13 23:57:41,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1536364.6666666667, ans=0.125 2023-10-13 23:57:51,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.19 vs. limit=22.5 2023-10-13 23:58:22,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.783e+02 1.904e+02 2.076e+02 2.792e+02, threshold=3.808e+02, percent-clipped=0.0 2023-10-13 23:58:28,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1536551.3333333333, ans=0.09899494936611666 2023-10-13 23:58:42,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1536598.0, ans=0.1 2023-10-13 23:58:54,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1536644.6666666667, ans=0.1 2023-10-13 23:59:16,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1536738.0, ans=0.125 2023-10-13 23:59:23,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1536738.0, ans=0.2 2023-10-13 23:59:26,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1536784.6666666667, ans=0.125 2023-10-13 23:59:35,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1536784.6666666667, ans=0.125 2023-10-13 23:59:35,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1536784.6666666667, ans=0.125 2023-10-13 23:59:36,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1536831.3333333333, ans=0.0 2023-10-13 23:59:39,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=1536831.3333333333, ans=15.0 2023-10-14 00:00:12,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1536971.3333333333, ans=0.1 2023-10-14 00:00:17,498 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.15 vs. limit=10.0 2023-10-14 00:00:19,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1536971.3333333333, ans=0.125 2023-10-14 00:00:25,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.773e+02 1.945e+02 2.171e+02 3.592e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-14 00:00:45,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1537064.6666666667, ans=0.2 2023-10-14 00:00:57,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1537111.3333333333, ans=0.125 2023-10-14 00:01:23,427 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.12 vs. limit=10.0 2023-10-14 00:01:25,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1537204.6666666667, ans=0.125 2023-10-14 00:01:28,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1537251.3333333333, ans=0.05 2023-10-14 00:01:28,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1537251.3333333333, ans=0.2 2023-10-14 00:01:33,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1537251.3333333333, ans=0.125 2023-10-14 00:01:46,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1537298.0, ans=0.125 2023-10-14 00:01:50,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.74 vs. limit=15.0 2023-10-14 00:02:07,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1537391.3333333333, ans=0.1 2023-10-14 00:02:07,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1537391.3333333333, ans=0.04949747468305833 2023-10-14 00:02:09,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1537391.3333333333, ans=0.125 2023-10-14 00:02:12,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1537391.3333333333, ans=0.1 2023-10-14 00:02:18,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1537438.0, ans=0.125 2023-10-14 00:02:23,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1537438.0, ans=0.0 2023-10-14 00:02:26,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1537484.6666666667, ans=0.0 2023-10-14 00:02:27,488 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.895e+02 2.057e+02 2.323e+02 3.446e+02, threshold=4.114e+02, percent-clipped=0.0 2023-10-14 00:02:40,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1537531.3333333333, ans=0.0 2023-10-14 00:03:44,169 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:03:52,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1537811.3333333333, ans=0.125 2023-10-14 00:04:01,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1537858.0, ans=0.1 2023-10-14 00:04:14,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1537904.6666666667, ans=0.035 2023-10-14 00:04:14,535 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.83 vs. limit=22.5 2023-10-14 00:04:20,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1537904.6666666667, ans=0.1 2023-10-14 00:04:27,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.743e+02 1.876e+02 2.012e+02 2.826e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-14 00:04:33,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1537951.3333333333, ans=0.125 2023-10-14 00:04:49,620 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-10-14 00:04:59,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1538044.6666666667, ans=0.0 2023-10-14 00:06:13,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1538371.3333333333, ans=0.1 2023-10-14 00:06:28,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1538418.0, ans=0.2 2023-10-14 00:06:29,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.812e+02 1.977e+02 2.245e+02 3.927e+02, threshold=3.953e+02, percent-clipped=1.0 2023-10-14 00:06:41,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1538418.0, ans=0.2 2023-10-14 00:06:45,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1538464.6666666667, ans=0.125 2023-10-14 00:06:53,227 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-10-14 00:07:04,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1538511.3333333333, ans=0.125 2023-10-14 00:07:04,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1538511.3333333333, ans=0.0 2023-10-14 00:07:06,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1538511.3333333333, ans=0.1 2023-10-14 00:07:10,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1538558.0, ans=0.0 2023-10-14 00:07:48,629 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-10-14 00:07:49,259 INFO [train.py:1031] (0/4) Epoch 25, batch 2000, loss[loss=0.1813, simple_loss=0.265, pruned_loss=0.04882, over 15278.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2793, pruned_loss=0.04716, over 20759977.62 frames. ], batch size: 35, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 00:07:56,256 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2023-10-14 00:08:01,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1538744.6666666667, ans=0.2 2023-10-14 00:08:25,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1538791.3333333333, ans=0.1 2023-10-14 00:08:37,681 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.529e-03 2023-10-14 00:08:49,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.798e+02 1.934e+02 2.110e+02 2.952e+02, threshold=3.867e+02, percent-clipped=0.0 2023-10-14 00:09:15,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1538978.0, ans=0.0 2023-10-14 00:09:16,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1538978.0, ans=0.2 2023-10-14 00:09:21,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1538978.0, ans=0.125 2023-10-14 00:09:31,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1539024.6666666667, ans=0.125 2023-10-14 00:09:47,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1539071.3333333333, ans=0.0 2023-10-14 00:10:59,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1539258.0, ans=0.125 2023-10-14 00:11:01,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1539258.0, ans=0.0 2023-10-14 00:11:03,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.04 vs. limit=15.0 2023-10-14 00:11:20,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.757e+02 2.000e+02 2.181e+02 3.208e+02, threshold=4.001e+02, percent-clipped=0.0 2023-10-14 00:11:22,772 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:11:28,967 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.37 vs. limit=22.5 2023-10-14 00:11:50,313 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.53 vs. limit=22.5 2023-10-14 00:11:52,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1539444.6666666667, ans=0.0 2023-10-14 00:12:04,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1539491.3333333333, ans=0.0 2023-10-14 00:12:07,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1539491.3333333333, ans=0.2 2023-10-14 00:12:07,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.57 vs. limit=10.0 2023-10-14 00:12:12,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1539538.0, ans=0.1 2023-10-14 00:12:17,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.15 vs. limit=22.5 2023-10-14 00:12:23,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1539584.6666666667, ans=0.0 2023-10-14 00:12:33,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1539631.3333333333, ans=0.125 2023-10-14 00:12:50,249 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-10-14 00:12:52,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1539678.0, ans=0.09899494936611666 2023-10-14 00:12:55,268 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.73 vs. limit=6.0 2023-10-14 00:13:08,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1539771.3333333333, ans=0.2 2023-10-14 00:13:17,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1539818.0, ans=0.2 2023-10-14 00:13:20,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.797e+02 1.953e+02 2.198e+02 2.698e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-14 00:13:27,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.21 vs. limit=10.0 2023-10-14 00:13:29,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1539864.6666666667, ans=0.035 2023-10-14 00:13:40,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1539864.6666666667, ans=0.1 2023-10-14 00:13:51,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1539958.0, ans=0.125 2023-10-14 00:14:49,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1540144.6666666667, ans=0.1 2023-10-14 00:15:18,770 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.863e+02 2.021e+02 2.260e+02 2.919e+02, threshold=4.043e+02, percent-clipped=0.0 2023-10-14 00:15:24,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1540284.6666666667, ans=0.0 2023-10-14 00:15:24,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1540284.6666666667, ans=0.125 2023-10-14 00:15:26,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1540331.3333333333, ans=0.125 2023-10-14 00:15:31,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1540331.3333333333, ans=0.1 2023-10-14 00:15:33,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1540331.3333333333, ans=0.125 2023-10-14 00:15:34,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1540331.3333333333, ans=0.125 2023-10-14 00:15:38,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1540331.3333333333, ans=0.0 2023-10-14 00:15:53,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1540424.6666666667, ans=0.2 2023-10-14 00:15:55,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1540424.6666666667, ans=0.2 2023-10-14 00:15:57,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1540424.6666666667, ans=0.125 2023-10-14 00:15:58,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1540424.6666666667, ans=0.0 2023-10-14 00:16:00,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1540424.6666666667, ans=0.125 2023-10-14 00:16:00,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1540424.6666666667, ans=0.2 2023-10-14 00:16:04,037 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.50 vs. limit=22.5 2023-10-14 00:16:21,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1540518.0, ans=0.125 2023-10-14 00:16:23,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1540518.0, ans=0.125 2023-10-14 00:16:27,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1540564.6666666667, ans=0.125 2023-10-14 00:16:47,004 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.44 vs. limit=15.0 2023-10-14 00:16:49,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1540658.0, ans=0.0 2023-10-14 00:16:58,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1540658.0, ans=0.125 2023-10-14 00:17:07,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1540704.6666666667, ans=0.125 2023-10-14 00:17:14,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.887e+02 2.023e+02 2.268e+02 3.456e+02, threshold=4.046e+02, percent-clipped=0.0 2023-10-14 00:17:45,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1540844.6666666667, ans=0.125 2023-10-14 00:18:09,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1540938.0, ans=0.0 2023-10-14 00:18:10,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1540984.6666666667, ans=0.1 2023-10-14 00:18:21,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1541031.3333333333, ans=0.125 2023-10-14 00:18:22,043 INFO [train.py:1031] (0/4) Epoch 25, batch 2500, loss[loss=0.187, simple_loss=0.2801, pruned_loss=0.04696, over 16608.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2795, pruned_loss=0.04739, over 23426732.79 frames. ], batch size: 241, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 00:18:36,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.43 vs. limit=12.0 2023-10-14 00:18:40,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.61 vs. limit=6.0 2023-10-14 00:18:42,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1541078.0, ans=0.0 2023-10-14 00:18:57,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1541171.3333333333, ans=0.0 2023-10-14 00:19:00,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1541171.3333333333, ans=0.1 2023-10-14 00:19:04,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1541218.0, ans=0.0 2023-10-14 00:19:05,201 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.95 vs. limit=15.0 2023-10-14 00:19:09,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.818e+02 1.984e+02 2.207e+02 3.452e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-14 00:19:37,575 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2023-10-14 00:19:42,345 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-14 00:19:44,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1541358.0, ans=0.2 2023-10-14 00:19:48,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1541404.6666666667, ans=0.2 2023-10-14 00:20:00,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1541451.3333333333, ans=0.2 2023-10-14 00:20:04,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1541451.3333333333, ans=0.125 2023-10-14 00:20:08,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1541451.3333333333, ans=0.2 2023-10-14 00:20:14,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1541498.0, ans=0.125 2023-10-14 00:20:36,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1541591.3333333333, ans=0.07 2023-10-14 00:20:42,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.90 vs. limit=22.5 2023-10-14 00:21:01,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.793e+02 1.951e+02 2.216e+02 3.118e+02, threshold=3.901e+02, percent-clipped=0.0 2023-10-14 00:21:22,696 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.14 vs. limit=15.0 2023-10-14 00:21:27,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1541778.0, ans=0.1 2023-10-14 00:21:31,230 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.55 vs. limit=10.0 2023-10-14 00:21:33,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1541824.6666666667, ans=0.125 2023-10-14 00:21:36,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1541824.6666666667, ans=0.1 2023-10-14 00:21:42,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1541824.6666666667, ans=0.2 2023-10-14 00:22:27,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1542011.3333333333, ans=0.125 2023-10-14 00:22:29,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1542011.3333333333, ans=0.125 2023-10-14 00:22:33,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1542011.3333333333, ans=0.125 2023-10-14 00:22:37,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1542058.0, ans=0.125 2023-10-14 00:22:43,692 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.25 vs. limit=10.0 2023-10-14 00:22:54,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1542104.6666666667, ans=0.125 2023-10-14 00:23:02,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.831e+02 2.019e+02 2.257e+02 3.678e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-14 00:23:02,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1542151.3333333333, ans=0.1 2023-10-14 00:23:13,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1542198.0, ans=0.0 2023-10-14 00:24:12,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.47 vs. limit=12.0 2023-10-14 00:24:57,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1542524.6666666667, ans=0.0 2023-10-14 00:25:14,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1542571.3333333333, ans=0.95 2023-10-14 00:25:20,756 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.365e+02 1.692e+02 1.896e+02 2.139e+02 3.227e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-14 00:25:24,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1542618.0, ans=0.125 2023-10-14 00:25:32,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1542664.6666666667, ans=0.0 2023-10-14 00:27:01,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1542991.3333333333, ans=0.09899494936611666 2023-10-14 00:27:12,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1543038.0, ans=0.125 2023-10-14 00:27:27,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.816e+02 2.044e+02 2.232e+02 3.404e+02, threshold=4.088e+02, percent-clipped=0.0 2023-10-14 00:27:27,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1543084.6666666667, ans=0.125 2023-10-14 00:27:35,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1543084.6666666667, ans=0.125 2023-10-14 00:27:54,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1543178.0, ans=0.0 2023-10-14 00:28:13,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1543271.3333333333, ans=0.125 2023-10-14 00:28:22,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1543318.0, ans=0.125 2023-10-14 00:28:33,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1543364.6666666667, ans=0.125 2023-10-14 00:28:34,432 INFO [train.py:1031] (0/4) Epoch 25, batch 3000, loss[loss=0.2006, simple_loss=0.2889, pruned_loss=0.05613, over 16967.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2788, pruned_loss=0.04729, over 25510174.50 frames. ], batch size: 130, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 00:28:59,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1543458.0, ans=10.0 2023-10-14 00:29:05,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1543458.0, ans=0.125 2023-10-14 00:29:11,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1543504.6666666667, ans=0.2 2023-10-14 00:29:15,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1543504.6666666667, ans=0.125 2023-10-14 00:29:24,389 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.780e+02 1.981e+02 2.163e+02 2.755e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-14 00:29:47,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1543644.6666666667, ans=0.07 2023-10-14 00:30:05,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1543691.3333333333, ans=0.125 2023-10-14 00:30:10,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1543738.0, ans=0.125 2023-10-14 00:30:20,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1543738.0, ans=0.0 2023-10-14 00:30:25,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1543784.6666666667, ans=0.125 2023-10-14 00:30:33,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1543784.6666666667, ans=0.125 2023-10-14 00:30:57,554 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.30 vs. limit=15.0 2023-10-14 00:31:01,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.38 vs. limit=15.0 2023-10-14 00:31:09,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.33 vs. limit=10.0 2023-10-14 00:31:16,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1543971.3333333333, ans=0.2 2023-10-14 00:31:19,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1543971.3333333333, ans=0.125 2023-10-14 00:31:32,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.796e+02 1.954e+02 2.120e+02 2.721e+02, threshold=3.909e+02, percent-clipped=0.0 2023-10-14 00:31:43,648 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-10-14 00:31:46,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1544064.6666666667, ans=0.125 2023-10-14 00:31:56,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1544111.3333333333, ans=0.2 2023-10-14 00:32:16,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1544204.6666666667, ans=0.125 2023-10-14 00:32:25,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.72 vs. limit=12.0 2023-10-14 00:32:26,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1544251.3333333333, ans=0.1 2023-10-14 00:32:38,342 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2023-10-14 00:32:45,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=22.5 2023-10-14 00:33:20,580 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.64 vs. limit=15.0 2023-10-14 00:33:35,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.785e+02 1.924e+02 2.119e+02 2.818e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-14 00:33:51,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1544531.3333333333, ans=0.05 2023-10-14 00:34:08,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1544578.0, ans=0.2 2023-10-14 00:34:24,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1544624.6666666667, ans=0.125 2023-10-14 00:34:43,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1544718.0, ans=0.125 2023-10-14 00:34:46,082 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-10-14 00:34:59,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1544764.6666666667, ans=0.0 2023-10-14 00:35:24,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1544904.6666666667, ans=0.125 2023-10-14 00:35:27,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1544904.6666666667, ans=0.125 2023-10-14 00:35:36,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1544951.3333333333, ans=0.125 2023-10-14 00:35:41,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.813e+02 1.968e+02 2.203e+02 3.239e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-14 00:35:45,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1544951.3333333333, ans=0.1 2023-10-14 00:35:46,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.71 vs. limit=15.0 2023-10-14 00:35:51,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1544998.0, ans=0.125 2023-10-14 00:35:56,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1544998.0, ans=0.0 2023-10-14 00:36:04,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-10-14 00:36:59,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1545231.3333333333, ans=0.125 2023-10-14 00:37:04,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1545278.0, ans=0.0 2023-10-14 00:37:43,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.782e+02 1.887e+02 2.042e+02 2.713e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-14 00:37:46,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1545418.0, ans=0.125 2023-10-14 00:37:53,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1545464.6666666667, ans=0.125 2023-10-14 00:38:10,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1545511.3333333333, ans=0.2 2023-10-14 00:38:10,539 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=15.0 2023-10-14 00:38:12,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1545558.0, ans=0.125 2023-10-14 00:38:40,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1545651.3333333333, ans=0.1 2023-10-14 00:38:50,343 INFO [train.py:1031] (0/4) Epoch 25, batch 3500, loss[loss=0.1973, simple_loss=0.2881, pruned_loss=0.05329, over 16902.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2789, pruned_loss=0.04762, over 27094176.90 frames. ], batch size: 77, lr: 1.38e-03, grad_scale: 16.0 2023-10-14 00:39:21,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1545791.3333333333, ans=0.125 2023-10-14 00:39:22,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1545791.3333333333, ans=0.125 2023-10-14 00:39:39,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1545884.6666666667, ans=0.0 2023-10-14 00:39:39,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.01 vs. limit=22.5 2023-10-14 00:39:41,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1545884.6666666667, ans=0.1 2023-10-14 00:39:42,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.855e+02 2.066e+02 2.343e+02 2.955e+02, threshold=4.131e+02, percent-clipped=0.0 2023-10-14 00:39:59,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1545978.0, ans=0.125 2023-10-14 00:40:03,161 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:40:36,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1546071.3333333333, ans=0.0 2023-10-14 00:40:44,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1546071.3333333333, ans=0.2 2023-10-14 00:40:53,916 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.92 vs. limit=15.0 2023-10-14 00:41:11,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1546164.6666666667, ans=0.05 2023-10-14 00:41:35,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1546258.0, ans=0.125 2023-10-14 00:41:50,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1546304.6666666667, ans=0.0 2023-10-14 00:41:50,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.25 vs. limit=15.0 2023-10-14 00:42:02,970 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.887e+02 2.077e+02 2.363e+02 2.995e+02, threshold=4.153e+02, percent-clipped=0.0 2023-10-14 00:42:13,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1546398.0, ans=0.125 2023-10-14 00:42:15,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1546398.0, ans=0.0 2023-10-14 00:42:28,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1546444.6666666667, ans=0.0 2023-10-14 00:42:44,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1546491.3333333333, ans=0.0 2023-10-14 00:42:54,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1546538.0, ans=0.125 2023-10-14 00:43:04,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1546538.0, ans=0.125 2023-10-14 00:43:20,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1546584.6666666667, ans=0.125 2023-10-14 00:43:41,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1546678.0, ans=0.2 2023-10-14 00:43:52,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1546678.0, ans=0.125 2023-10-14 00:43:54,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1546678.0, ans=0.125 2023-10-14 00:44:38,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1546818.0, ans=0.0 2023-10-14 00:44:43,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.715e+02 1.840e+02 2.077e+02 2.935e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-14 00:45:02,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1546864.6666666667, ans=0.0 2023-10-14 00:45:11,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1546911.3333333333, ans=0.125 2023-10-14 00:45:16,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1546911.3333333333, ans=0.125 2023-10-14 00:45:18,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.72 vs. limit=22.5 2023-10-14 00:45:32,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.78 vs. limit=15.0 2023-10-14 00:45:42,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1547004.6666666667, ans=0.0 2023-10-14 00:46:24,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1547144.6666666667, ans=0.125 2023-10-14 00:46:31,885 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-10-14 00:47:17,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.776e+02 1.879e+02 2.184e+02 2.734e+02, threshold=3.759e+02, percent-clipped=0.0 2023-10-14 00:47:28,132 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.55 vs. limit=12.0 2023-10-14 00:47:40,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1547331.3333333333, ans=0.125 2023-10-14 00:48:13,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1547471.3333333333, ans=0.0 2023-10-14 00:48:58,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1547564.6666666667, ans=0.1 2023-10-14 00:49:06,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1547611.3333333333, ans=0.125 2023-10-14 00:49:11,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1547611.3333333333, ans=0.125 2023-10-14 00:49:45,885 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.46 vs. limit=15.0 2023-10-14 00:49:47,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.771e+02 1.917e+02 2.194e+02 3.241e+02, threshold=3.833e+02, percent-clipped=0.0 2023-10-14 00:49:50,585 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.17 vs. limit=22.5 2023-10-14 00:50:23,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1547891.3333333333, ans=0.125 2023-10-14 00:50:28,280 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.89 vs. limit=22.5 2023-10-14 00:50:31,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1547891.3333333333, ans=0.1 2023-10-14 00:50:40,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1547938.0, ans=0.125 2023-10-14 00:51:02,445 INFO [train.py:1031] (0/4) Epoch 25, batch 4000, loss[loss=0.1758, simple_loss=0.2671, pruned_loss=0.04226, over 16532.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2785, pruned_loss=0.04771, over 28353641.66 frames. ], batch size: 56, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 00:51:03,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-10-14 00:51:27,283 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:51:32,952 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.01 vs. limit=15.0 2023-10-14 00:51:47,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1548124.6666666667, ans=0.1 2023-10-14 00:52:08,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1548218.0, ans=0.07 2023-10-14 00:52:13,875 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.861e+02 2.042e+02 2.213e+02 2.987e+02, threshold=4.083e+02, percent-clipped=0.0 2023-10-14 00:52:45,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1548358.0, ans=0.125 2023-10-14 00:52:46,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-10-14 00:53:17,623 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-10-14 00:53:35,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1548498.0, ans=0.125 2023-10-14 00:53:56,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1548591.3333333333, ans=0.0 2023-10-14 00:54:00,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1548591.3333333333, ans=0.09899494936611666 2023-10-14 00:54:07,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1548591.3333333333, ans=0.0 2023-10-14 00:54:22,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1548638.0, ans=0.0 2023-10-14 00:54:42,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1548684.6666666667, ans=0.09899494936611666 2023-10-14 00:54:43,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.801e+02 1.971e+02 2.166e+02 3.356e+02, threshold=3.943e+02, percent-clipped=0.0 2023-10-14 00:54:50,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1548731.3333333333, ans=0.1 2023-10-14 00:54:51,216 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=15.0 2023-10-14 00:54:56,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1548731.3333333333, ans=0.0 2023-10-14 00:55:24,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1548824.6666666667, ans=15.0 2023-10-14 00:55:35,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1548824.6666666667, ans=0.0 2023-10-14 00:55:47,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1548871.3333333333, ans=0.0 2023-10-14 00:55:48,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1548871.3333333333, ans=0.1 2023-10-14 00:56:18,397 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:56:18,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1548964.6666666667, ans=0.125 2023-10-14 00:56:35,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1548964.6666666667, ans=0.125 2023-10-14 00:56:47,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1549011.3333333333, ans=0.1 2023-10-14 00:57:09,598 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-10-14 00:57:49,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.816e+02 2.035e+02 2.630e+02 4.385e+02, threshold=4.070e+02, percent-clipped=1.0 2023-10-14 00:57:58,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1549198.0, ans=0.125 2023-10-14 00:57:58,523 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 00:58:37,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1549291.3333333333, ans=0.125 2023-10-14 00:59:21,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1549384.6666666667, ans=0.1 2023-10-14 00:59:51,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1549478.0, ans=0.0 2023-10-14 01:00:29,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1549571.3333333333, ans=0.0 2023-10-14 01:00:39,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1549618.0, ans=0.0 2023-10-14 01:00:47,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.779e+02 1.981e+02 2.147e+02 2.752e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-14 01:01:09,989 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:01:21,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1549711.3333333333, ans=0.04949747468305833 2023-10-14 01:01:33,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1549758.0, ans=0.1 2023-10-14 01:01:43,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1549804.6666666667, ans=0.1 2023-10-14 01:01:56,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1549851.3333333333, ans=0.125 2023-10-14 01:02:06,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1549851.3333333333, ans=0.0 2023-10-14 01:02:44,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1549944.6666666667, ans=0.1 2023-10-14 01:03:21,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1550038.0, ans=0.2 2023-10-14 01:03:29,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1550038.0, ans=0.025 2023-10-14 01:04:01,127 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.86 vs. limit=15.0 2023-10-14 01:04:01,389 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.881e+02 2.079e+02 2.309e+02 3.185e+02, threshold=4.159e+02, percent-clipped=0.0 2023-10-14 01:04:03,588 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.00 vs. limit=6.0 2023-10-14 01:04:23,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1550178.0, ans=0.125 2023-10-14 01:04:25,595 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:04:34,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1550178.0, ans=0.125 2023-10-14 01:04:54,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1550271.3333333333, ans=0.125 2023-10-14 01:05:07,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1550271.3333333333, ans=0.0 2023-10-14 01:05:07,835 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.47 vs. limit=10.0 2023-10-14 01:05:29,558 INFO [train.py:1031] (0/4) Epoch 25, batch 4500, loss[loss=0.1696, simple_loss=0.2597, pruned_loss=0.03969, over 16187.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2787, pruned_loss=0.04743, over 29338107.89 frames. ], batch size: 50, lr: 1.38e-03, grad_scale: 16.0 2023-10-14 01:05:44,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1550411.3333333333, ans=0.2 2023-10-14 01:06:06,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1550458.0, ans=0.125 2023-10-14 01:06:10,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1550458.0, ans=0.125 2023-10-14 01:06:45,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1550551.3333333333, ans=0.125 2023-10-14 01:06:53,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.765e+02 1.889e+02 2.110e+02 3.343e+02, threshold=3.778e+02, percent-clipped=0.0 2023-10-14 01:06:57,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1550598.0, ans=0.07 2023-10-14 01:07:19,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1550644.6666666667, ans=0.125 2023-10-14 01:07:57,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1550738.0, ans=0.0 2023-10-14 01:07:59,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1550738.0, ans=0.2 2023-10-14 01:08:09,361 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.70 vs. limit=12.0 2023-10-14 01:08:25,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1550831.3333333333, ans=0.125 2023-10-14 01:08:28,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.57 vs. limit=15.0 2023-10-14 01:08:37,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1550831.3333333333, ans=0.2 2023-10-14 01:08:41,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1550831.3333333333, ans=0.2 2023-10-14 01:09:12,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1550924.6666666667, ans=10.0 2023-10-14 01:09:47,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1551018.0, ans=0.0 2023-10-14 01:10:00,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.916e+02 2.048e+02 2.213e+02 3.103e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-14 01:10:02,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1551064.6666666667, ans=0.1 2023-10-14 01:10:34,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1551111.3333333333, ans=0.125 2023-10-14 01:11:27,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1551298.0, ans=0.2 2023-10-14 01:11:34,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-10-14 01:11:46,245 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:11:53,608 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.04 vs. limit=10.0 2023-10-14 01:12:11,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1551391.3333333333, ans=0.2 2023-10-14 01:12:11,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1551391.3333333333, ans=0.5 2023-10-14 01:12:28,362 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.19 vs. limit=15.0 2023-10-14 01:12:53,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.813e+02 2.018e+02 2.178e+02 2.751e+02, threshold=4.036e+02, percent-clipped=0.0 2023-10-14 01:12:59,989 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:13:07,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1551531.3333333333, ans=0.0 2023-10-14 01:13:28,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1551578.0, ans=6.0 2023-10-14 01:13:39,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1551624.6666666667, ans=0.2 2023-10-14 01:14:22,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1551718.0, ans=0.0 2023-10-14 01:14:37,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1551764.6666666667, ans=0.125 2023-10-14 01:15:01,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1551858.0, ans=0.1 2023-10-14 01:15:18,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1551904.6666666667, ans=0.015 2023-10-14 01:15:24,845 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-10-14 01:15:45,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1551951.3333333333, ans=0.1 2023-10-14 01:16:02,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.04 vs. limit=15.0 2023-10-14 01:16:06,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.775e+02 1.914e+02 2.145e+02 2.747e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-14 01:16:14,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1551998.0, ans=0.125 2023-10-14 01:16:27,557 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-10-14 01:16:51,955 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.08 vs. limit=15.0 2023-10-14 01:16:56,766 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=12.0 2023-10-14 01:17:13,500 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-10-14 01:17:18,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1552138.0, ans=0.125 2023-10-14 01:17:34,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1552184.6666666667, ans=0.125 2023-10-14 01:17:56,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1552231.3333333333, ans=0.1 2023-10-14 01:18:07,213 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-10-14 01:18:12,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1552278.0, ans=0.0 2023-10-14 01:18:12,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1552278.0, ans=0.125 2023-10-14 01:18:35,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1552324.6666666667, ans=0.0 2023-10-14 01:18:38,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1552324.6666666667, ans=0.125 2023-10-14 01:18:48,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1552371.3333333333, ans=0.1 2023-10-14 01:19:11,623 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.783e+02 1.972e+02 2.129e+02 2.952e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-14 01:19:19,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1552464.6666666667, ans=0.0 2023-10-14 01:19:44,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1552511.3333333333, ans=0.125 2023-10-14 01:20:29,621 INFO [train.py:1031] (0/4) Epoch 25, batch 5000, loss[loss=0.1827, simple_loss=0.252, pruned_loss=0.05667, over 12617.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2786, pruned_loss=0.04758, over 30138758.00 frames. ], batch size: 440, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 01:20:30,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1552698.0, ans=0.125 2023-10-14 01:20:46,959 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.72 vs. limit=10.0 2023-10-14 01:20:47,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1552744.6666666667, ans=0.035 2023-10-14 01:20:57,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1552791.3333333333, ans=0.07 2023-10-14 01:20:57,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1552791.3333333333, ans=0.125 2023-10-14 01:21:02,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1552791.3333333333, ans=0.125 2023-10-14 01:21:31,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.780e+02 1.937e+02 2.152e+02 3.729e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-14 01:21:33,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1552931.3333333333, ans=0.0 2023-10-14 01:21:33,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1552931.3333333333, ans=0.125 2023-10-14 01:21:36,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1552931.3333333333, ans=0.125 2023-10-14 01:21:47,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1552978.0, ans=0.1 2023-10-14 01:21:49,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1552978.0, ans=0.1 2023-10-14 01:21:54,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1552978.0, ans=0.125 2023-10-14 01:21:55,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1552978.0, ans=0.125 2023-10-14 01:22:01,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1553024.6666666667, ans=0.125 2023-10-14 01:22:10,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1553024.6666666667, ans=0.125 2023-10-14 01:22:11,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1553024.6666666667, ans=0.125 2023-10-14 01:22:21,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1553071.3333333333, ans=0.0 2023-10-14 01:22:36,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1553118.0, ans=0.125 2023-10-14 01:22:46,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.17 vs. limit=22.5 2023-10-14 01:22:48,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1553164.6666666667, ans=0.0 2023-10-14 01:22:53,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1553164.6666666667, ans=0.2 2023-10-14 01:23:08,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.95 vs. limit=15.0 2023-10-14 01:23:17,439 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.59 vs. limit=10.0 2023-10-14 01:23:29,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1553304.6666666667, ans=0.2 2023-10-14 01:23:48,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.815e+02 2.017e+02 2.279e+02 2.951e+02, threshold=4.034e+02, percent-clipped=0.0 2023-10-14 01:23:53,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.33 vs. limit=15.0 2023-10-14 01:23:59,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1553398.0, ans=0.0 2023-10-14 01:24:10,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1553444.6666666667, ans=0.125 2023-10-14 01:24:25,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.94 vs. limit=15.0 2023-10-14 01:24:34,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1553491.3333333333, ans=0.0 2023-10-14 01:25:28,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1553678.0, ans=0.2 2023-10-14 01:25:44,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1553678.0, ans=0.0 2023-10-14 01:25:44,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1553678.0, ans=0.0 2023-10-14 01:25:51,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1553724.6666666667, ans=0.1 2023-10-14 01:26:25,399 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.78 vs. limit=22.5 2023-10-14 01:26:35,101 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.838e+02 2.064e+02 2.440e+02 3.125e+02, threshold=4.128e+02, percent-clipped=0.0 2023-10-14 01:26:37,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1553864.6666666667, ans=0.1 2023-10-14 01:26:40,291 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.50 vs. limit=15.0 2023-10-14 01:27:15,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1553958.0, ans=0.1 2023-10-14 01:27:29,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1554004.6666666667, ans=0.125 2023-10-14 01:27:54,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1554051.3333333333, ans=0.125 2023-10-14 01:27:55,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1554051.3333333333, ans=0.125 2023-10-14 01:27:55,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1554051.3333333333, ans=0.2 2023-10-14 01:28:12,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1554098.0, ans=0.0 2023-10-14 01:28:15,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1554098.0, ans=0.0 2023-10-14 01:28:18,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1554144.6666666667, ans=0.125 2023-10-14 01:28:43,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1554191.3333333333, ans=0.125 2023-10-14 01:29:25,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.754e+02 1.879e+02 2.100e+02 3.034e+02, threshold=3.757e+02, percent-clipped=0.0 2023-10-14 01:29:55,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1554378.0, ans=0.0 2023-10-14 01:30:34,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1554471.3333333333, ans=0.125 2023-10-14 01:30:34,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-10-14 01:30:44,637 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.23 vs. limit=15.0 2023-10-14 01:30:44,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.83 vs. limit=6.0 2023-10-14 01:30:51,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1554518.0, ans=0.125 2023-10-14 01:31:11,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1554564.6666666667, ans=0.1 2023-10-14 01:31:48,337 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.78 vs. limit=15.0 2023-10-14 01:31:59,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1554751.3333333333, ans=0.0 2023-10-14 01:32:01,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1554751.3333333333, ans=0.0 2023-10-14 01:32:12,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.741e+02 1.894e+02 2.090e+02 3.309e+02, threshold=3.787e+02, percent-clipped=0.0 2023-10-14 01:32:12,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1554751.3333333333, ans=0.1 2023-10-14 01:32:19,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1554798.0, ans=0.1 2023-10-14 01:32:33,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1554844.6666666667, ans=0.0 2023-10-14 01:32:45,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1554891.3333333333, ans=0.0 2023-10-14 01:32:47,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1554891.3333333333, ans=0.07 2023-10-14 01:33:01,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1554938.0, ans=0.5 2023-10-14 01:33:15,027 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-10-14 01:33:22,387 INFO [train.py:1031] (0/4) Epoch 25, batch 5500, loss[loss=0.1967, simple_loss=0.2815, pruned_loss=0.05599, over 16049.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2784, pruned_loss=0.04742, over 30721510.57 frames. ], batch size: 296, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 01:33:36,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1555078.0, ans=0.125 2023-10-14 01:33:36,046 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:33:40,989 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:33:41,393 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.46 vs. limit=15.0 2023-10-14 01:33:43,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1555078.0, ans=0.0 2023-10-14 01:34:08,472 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-10-14 01:34:17,237 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:34:19,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1555218.0, ans=0.2 2023-10-14 01:34:23,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1555218.0, ans=0.1 2023-10-14 01:34:24,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.796e+02 1.966e+02 2.130e+02 3.043e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-14 01:34:39,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1555311.3333333333, ans=0.0 2023-10-14 01:35:53,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-10-14 01:36:02,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1555591.3333333333, ans=0.125 2023-10-14 01:36:12,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1555591.3333333333, ans=0.125 2023-10-14 01:36:38,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1555684.6666666667, ans=0.125 2023-10-14 01:36:46,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.823e+02 1.954e+02 2.145e+02 2.802e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-14 01:36:58,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1555731.3333333333, ans=0.125 2023-10-14 01:37:05,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1555778.0, ans=0.125 2023-10-14 01:37:07,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1555778.0, ans=0.0 2023-10-14 01:37:18,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1555824.6666666667, ans=0.1 2023-10-14 01:37:32,081 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-10-14 01:37:33,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1555871.3333333333, ans=0.125 2023-10-14 01:37:49,959 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.02 vs. limit=10.0 2023-10-14 01:38:08,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1555918.0, ans=0.125 2023-10-14 01:38:28,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1556011.3333333333, ans=0.0 2023-10-14 01:38:44,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1556058.0, ans=10.0 2023-10-14 01:38:56,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1556104.6666666667, ans=0.2 2023-10-14 01:38:56,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1556104.6666666667, ans=0.0 2023-10-14 01:39:02,921 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.41 vs. limit=15.0 2023-10-14 01:39:19,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.866e+02 1.984e+02 2.200e+02 3.265e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-14 01:39:25,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1556198.0, ans=0.1 2023-10-14 01:40:21,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1556384.6666666667, ans=0.125 2023-10-14 01:40:49,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1556431.3333333333, ans=0.2 2023-10-14 01:41:06,623 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.73 vs. limit=12.0 2023-10-14 01:41:08,000 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.22 vs. limit=22.5 2023-10-14 01:41:09,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1556478.0, ans=0.125 2023-10-14 01:41:10,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1556478.0, ans=0.1 2023-10-14 01:41:14,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1556524.6666666667, ans=0.0 2023-10-14 01:41:20,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1556524.6666666667, ans=0.09899494936611666 2023-10-14 01:41:21,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1556524.6666666667, ans=0.125 2023-10-14 01:41:36,169 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.08 vs. limit=15.0 2023-10-14 01:42:08,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1556618.0, ans=0.04949747468305833 2023-10-14 01:42:11,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.813e+02 1.976e+02 2.211e+02 3.514e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-14 01:42:37,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1556711.3333333333, ans=0.07 2023-10-14 01:42:52,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1556758.0, ans=0.0 2023-10-14 01:43:00,344 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=15.0 2023-10-14 01:43:08,865 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.33 vs. limit=15.0 2023-10-14 01:43:20,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1556804.6666666667, ans=0.1 2023-10-14 01:43:38,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1556851.3333333333, ans=0.125 2023-10-14 01:43:43,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1556851.3333333333, ans=0.125 2023-10-14 01:44:09,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1556898.0, ans=0.125 2023-10-14 01:44:18,260 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.98 vs. limit=10.0 2023-10-14 01:44:19,287 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.93 vs. limit=15.0 2023-10-14 01:44:32,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1556944.6666666667, ans=0.2 2023-10-14 01:45:13,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1557038.0, ans=0.125 2023-10-14 01:45:23,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-10-14 01:45:50,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.770e+02 2.028e+02 2.296e+02 3.429e+02, threshold=4.056e+02, percent-clipped=0.0 2023-10-14 01:46:03,052 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.47 vs. limit=12.0 2023-10-14 01:46:47,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1557224.6666666667, ans=0.125 2023-10-14 01:47:19,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1557318.0, ans=0.2 2023-10-14 01:47:31,311 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.85 vs. limit=15.0 2023-10-14 01:47:31,534 INFO [train.py:1031] (0/4) Epoch 25, batch 6000, loss[loss=0.1894, simple_loss=0.2799, pruned_loss=0.04944, over 15766.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2784, pruned_loss=0.04748, over 31180632.17 frames. ], batch size: 36, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 01:48:43,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.921e+02 2.127e+02 2.411e+02 3.205e+02, threshold=4.254e+02, percent-clipped=0.0 2023-10-14 01:48:50,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1557598.0, ans=15.0 2023-10-14 01:49:04,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557691.3333333333, ans=0.1 2023-10-14 01:49:14,923 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 01:49:28,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.74 vs. limit=10.0 2023-10-14 01:49:30,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1557784.6666666667, ans=0.0 2023-10-14 01:49:37,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1557784.6666666667, ans=0.05 2023-10-14 01:49:45,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.34 vs. limit=15.0 2023-10-14 01:50:06,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1557924.6666666667, ans=0.0 2023-10-14 01:50:19,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1557971.3333333333, ans=0.125 2023-10-14 01:50:33,213 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.09 vs. limit=10.0 2023-10-14 01:50:34,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1558018.0, ans=0.125 2023-10-14 01:50:37,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.867e+02 2.061e+02 2.265e+02 3.155e+02, threshold=4.122e+02, percent-clipped=0.0 2023-10-14 01:50:41,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1558064.6666666667, ans=0.125 2023-10-14 01:50:44,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1558064.6666666667, ans=0.0 2023-10-14 01:51:03,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1558158.0, ans=0.2 2023-10-14 01:51:52,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1558344.6666666667, ans=0.125 2023-10-14 01:52:31,371 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.69 vs. limit=15.0 2023-10-14 01:52:33,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.987e+02 2.187e+02 2.456e+02 3.411e+02, threshold=4.374e+02, percent-clipped=0.0 2023-10-14 01:53:05,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1558624.6666666667, ans=0.125 2023-10-14 01:53:20,074 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-10-14 01:53:29,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1558718.0, ans=0.125 2023-10-14 01:53:46,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1558811.3333333333, ans=0.035 2023-10-14 01:53:46,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1558811.3333333333, ans=0.125 2023-10-14 01:53:53,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1558811.3333333333, ans=0.0 2023-10-14 01:53:53,355 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-10-14 01:54:02,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=1558858.0, ans=0.1 2023-10-14 01:54:02,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1558858.0, ans=0.0 2023-10-14 01:54:03,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1558858.0, ans=0.1 2023-10-14 01:54:11,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1558858.0, ans=0.125 2023-10-14 01:54:35,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1558951.3333333333, ans=0.0 2023-10-14 01:54:38,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.848e+02 2.039e+02 2.330e+02 3.399e+02, threshold=4.078e+02, percent-clipped=0.0 2023-10-14 01:54:45,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1558998.0, ans=0.0 2023-10-14 01:54:48,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1558998.0, ans=0.0 2023-10-14 01:54:50,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1559044.6666666667, ans=0.0 2023-10-14 01:55:04,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1559044.6666666667, ans=0.125 2023-10-14 01:55:10,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1559091.3333333333, ans=0.07 2023-10-14 01:55:12,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1559091.3333333333, ans=0.07 2023-10-14 01:55:37,943 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.93 vs. limit=10.0 2023-10-14 01:56:03,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=12.0 2023-10-14 01:56:29,886 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-10-14 01:56:50,176 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.64 vs. limit=15.0 2023-10-14 01:56:53,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.37 vs. limit=15.0 2023-10-14 01:57:08,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.761e+02 1.975e+02 2.221e+02 3.023e+02, threshold=3.950e+02, percent-clipped=0.0 2023-10-14 01:57:20,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1559464.6666666667, ans=0.125 2023-10-14 01:57:36,169 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-10-14 01:58:10,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1559651.3333333333, ans=0.0 2023-10-14 01:58:15,179 INFO [train.py:1031] (0/4) Epoch 25, batch 6500, loss[loss=0.2054, simple_loss=0.2983, pruned_loss=0.05629, over 16606.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.279, pruned_loss=0.04773, over 31510042.14 frames. ], batch size: 241, lr: 1.38e-03, grad_scale: 32.0 2023-10-14 01:58:21,984 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.04 vs. limit=15.0 2023-10-14 01:58:24,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1559698.0, ans=0.1 2023-10-14 01:59:20,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1559884.6666666667, ans=0.1 2023-10-14 01:59:37,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.812e+02 2.001e+02 2.241e+02 2.998e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-14 01:59:48,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.65 vs. limit=22.5 2023-10-14 01:59:51,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1559978.0, ans=0.125 2023-10-14 01:59:54,836 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-10-14 02:00:05,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1559978.0, ans=0.125 2023-10-14 02:00:24,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-10-14 02:00:31,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1560071.3333333333, ans=0.0 2023-10-14 02:00:44,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1560118.0, ans=0.125 2023-10-14 02:00:55,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1560164.6666666667, ans=0.0 2023-10-14 02:01:40,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1560304.6666666667, ans=0.0 2023-10-14 02:01:55,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.833e+02 2.027e+02 2.262e+02 3.241e+02, threshold=4.054e+02, percent-clipped=0.0 2023-10-14 02:02:00,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1560398.0, ans=0.0 2023-10-14 02:02:07,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1560444.6666666667, ans=0.0 2023-10-14 02:02:09,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1560444.6666666667, ans=0.0 2023-10-14 02:02:24,146 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-10-14 02:02:59,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1560631.3333333333, ans=0.125 2023-10-14 02:03:04,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1560631.3333333333, ans=0.1 2023-10-14 02:03:13,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1560678.0, ans=0.0 2023-10-14 02:03:16,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1560678.0, ans=0.1 2023-10-14 02:03:34,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1560724.6666666667, ans=0.125 2023-10-14 02:03:36,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.37 vs. limit=15.0 2023-10-14 02:03:58,001 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.68 vs. limit=15.0 2023-10-14 02:04:06,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.701e+02 1.868e+02 2.039e+02 2.419e+02, threshold=3.736e+02, percent-clipped=0.0 2023-10-14 02:04:20,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1560911.3333333333, ans=0.05 2023-10-14 02:04:37,343 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.48 vs. limit=12.0 2023-10-14 02:04:40,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1560958.0, ans=0.0 2023-10-14 02:04:41,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1560958.0, ans=0.1 2023-10-14 02:04:45,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1560958.0, ans=0.1 2023-10-14 02:04:46,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1561004.6666666667, ans=0.1 2023-10-14 02:04:56,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1561004.6666666667, ans=0.125 2023-10-14 02:05:01,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1561051.3333333333, ans=0.09899494936611666 2023-10-14 02:05:09,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1561051.3333333333, ans=0.0 2023-10-14 02:05:26,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1561098.0, ans=0.1 2023-10-14 02:05:49,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1561144.6666666667, ans=0.0 2023-10-14 02:06:15,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1561238.0, ans=0.125 2023-10-14 02:06:19,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1561238.0, ans=0.025 2023-10-14 02:06:39,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1561331.3333333333, ans=0.125 2023-10-14 02:06:42,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1561331.3333333333, ans=0.125 2023-10-14 02:06:43,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.735e+02 1.865e+02 2.148e+02 3.287e+02, threshold=3.731e+02, percent-clipped=0.0 2023-10-14 02:07:01,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1561378.0, ans=0.2 2023-10-14 02:07:02,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1561378.0, ans=0.0 2023-10-14 02:07:10,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1561424.6666666667, ans=0.1 2023-10-14 02:07:29,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=12.0 2023-10-14 02:07:47,324 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.59 vs. limit=15.0 2023-10-14 02:07:59,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1561611.3333333333, ans=0.1 2023-10-14 02:08:07,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1561611.3333333333, ans=0.125 2023-10-14 02:08:17,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1561658.0, ans=0.125 2023-10-14 02:08:48,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1561751.3333333333, ans=0.125 2023-10-14 02:08:49,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1561798.0, ans=0.1 2023-10-14 02:08:52,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.780e+02 1.948e+02 2.216e+02 3.561e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-14 02:09:15,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1561891.3333333333, ans=0.125 2023-10-14 02:09:27,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1561938.0, ans=0.125 2023-10-14 02:09:28,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-10-14 02:09:29,994 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.96 vs. limit=10.0 2023-10-14 02:09:52,074 INFO [train.py:1031] (0/4) Epoch 25, batch 7000, loss[loss=0.1981, simple_loss=0.2914, pruned_loss=0.05247, over 16705.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2794, pruned_loss=0.04759, over 31806916.63 frames. ], batch size: 202, lr: 1.37e-03, grad_scale: 16.0 2023-10-14 02:10:07,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1562078.0, ans=0.0 2023-10-14 02:10:09,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1562078.0, ans=0.1 2023-10-14 02:10:10,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1562078.0, ans=0.125 2023-10-14 02:10:11,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1562078.0, ans=0.125 2023-10-14 02:10:28,717 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-10-14 02:10:35,103 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-10-14 02:11:01,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.797e+02 1.946e+02 2.145e+02 2.799e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-14 02:11:01,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1562264.6666666667, ans=0.125 2023-10-14 02:12:29,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1562544.6666666667, ans=0.125 2023-10-14 02:12:34,029 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.31 vs. limit=6.0 2023-10-14 02:12:34,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1562591.3333333333, ans=0.125 2023-10-14 02:12:43,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1562591.3333333333, ans=0.125 2023-10-14 02:12:51,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1562638.0, ans=0.125 2023-10-14 02:13:04,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1562684.6666666667, ans=0.2 2023-10-14 02:13:04,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1562684.6666666667, ans=0.0 2023-10-14 02:13:07,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1562684.6666666667, ans=0.0 2023-10-14 02:13:14,292 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.78 vs. limit=22.5 2023-10-14 02:13:17,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.19 vs. limit=15.0 2023-10-14 02:13:22,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1562731.3333333333, ans=0.125 2023-10-14 02:13:23,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.875e+02 1.996e+02 2.168e+02 2.790e+02, threshold=3.991e+02, percent-clipped=0.0 2023-10-14 02:13:31,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-10-14 02:14:00,415 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:14:27,147 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.32 vs. limit=15.0 2023-10-14 02:14:35,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1562964.6666666667, ans=0.1 2023-10-14 02:14:42,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=15.0 2023-10-14 02:14:46,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1563011.3333333333, ans=0.125 2023-10-14 02:15:06,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1563058.0, ans=0.1 2023-10-14 02:15:12,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1563058.0, ans=0.1 2023-10-14 02:15:33,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1563151.3333333333, ans=0.1 2023-10-14 02:15:51,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1563198.0, ans=0.125 2023-10-14 02:15:54,168 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.797e+02 1.967e+02 2.100e+02 2.869e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-14 02:16:06,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1563244.6666666667, ans=0.0 2023-10-14 02:16:17,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1563244.6666666667, ans=0.1 2023-10-14 02:16:17,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.86 vs. limit=15.0 2023-10-14 02:16:18,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1563291.3333333333, ans=0.2 2023-10-14 02:16:22,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1563291.3333333333, ans=0.09899494936611666 2023-10-14 02:16:32,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1563338.0, ans=0.1 2023-10-14 02:16:34,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1563338.0, ans=0.2 2023-10-14 02:16:35,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1563338.0, ans=0.1 2023-10-14 02:17:02,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1563431.3333333333, ans=0.025 2023-10-14 02:17:08,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1563431.3333333333, ans=0.0 2023-10-14 02:17:09,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1563431.3333333333, ans=0.2 2023-10-14 02:17:17,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1563478.0, ans=0.125 2023-10-14 02:17:18,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1563478.0, ans=0.2 2023-10-14 02:17:26,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1563478.0, ans=0.2 2023-10-14 02:17:48,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1563571.3333333333, ans=0.1 2023-10-14 02:17:48,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1563571.3333333333, ans=0.1 2023-10-14 02:17:59,405 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:18:05,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1563618.0, ans=0.1 2023-10-14 02:18:05,893 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.71 vs. limit=15.0 2023-10-14 02:18:09,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1563618.0, ans=0.0 2023-10-14 02:18:14,581 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.906e+02 2.098e+02 2.525e+02 3.342e+02, threshold=4.197e+02, percent-clipped=0.0 2023-10-14 02:18:20,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1563664.6666666667, ans=0.125 2023-10-14 02:18:25,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1563711.3333333333, ans=0.125 2023-10-14 02:18:56,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1563804.6666666667, ans=0.125 2023-10-14 02:19:31,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1563944.6666666667, ans=0.0 2023-10-14 02:19:32,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1563944.6666666667, ans=0.0 2023-10-14 02:19:51,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1563991.3333333333, ans=0.125 2023-10-14 02:19:52,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1563991.3333333333, ans=0.125 2023-10-14 02:19:58,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.74 vs. limit=12.0 2023-10-14 02:20:07,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1564038.0, ans=0.2 2023-10-14 02:20:25,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.857e+02 1.973e+02 2.145e+02 3.193e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-14 02:20:37,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1564178.0, ans=0.125 2023-10-14 02:20:47,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1564224.6666666667, ans=0.1 2023-10-14 02:20:55,707 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.47 vs. limit=22.5 2023-10-14 02:20:58,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1564224.6666666667, ans=0.0 2023-10-14 02:21:23,407 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-10-14 02:21:28,742 INFO [train.py:1031] (0/4) Epoch 25, batch 7500, loss[loss=0.1965, simple_loss=0.2939, pruned_loss=0.04953, over 16861.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2793, pruned_loss=0.04771, over 32020940.05 frames. ], batch size: 146, lr: 1.37e-03, grad_scale: 32.0 2023-10-14 02:21:55,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1564458.0, ans=0.125 2023-10-14 02:22:00,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1564458.0, ans=0.125 2023-10-14 02:22:16,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1564504.6666666667, ans=0.125 2023-10-14 02:22:39,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.803e+02 1.977e+02 2.133e+02 3.002e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-14 02:22:56,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1564644.6666666667, ans=0.1 2023-10-14 02:22:57,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1564644.6666666667, ans=0.0 2023-10-14 02:22:57,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1564644.6666666667, ans=0.035 2023-10-14 02:23:05,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1564691.3333333333, ans=0.125 2023-10-14 02:23:16,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1564691.3333333333, ans=0.0 2023-10-14 02:23:37,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1564784.6666666667, ans=0.015 2023-10-14 02:24:02,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1564878.0, ans=0.0 2023-10-14 02:24:06,275 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-10-14 02:24:07,300 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-10-14 02:24:11,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.21 vs. limit=15.0 2023-10-14 02:24:12,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.41 vs. limit=15.0 2023-10-14 02:24:14,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1564924.6666666667, ans=0.125 2023-10-14 02:24:20,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1564924.6666666667, ans=0.0 2023-10-14 02:24:48,029 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:24:49,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1564971.3333333333, ans=0.125 2023-10-14 02:24:59,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1565018.0, ans=0.125 2023-10-14 02:25:10,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1565064.6666666667, ans=0.1 2023-10-14 02:25:12,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.822e+02 1.958e+02 2.209e+02 3.104e+02, threshold=3.916e+02, percent-clipped=0.0 2023-10-14 02:25:23,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1565111.3333333333, ans=0.125 2023-10-14 02:25:32,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1565111.3333333333, ans=0.125 2023-10-14 02:25:44,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1565158.0, ans=0.1 2023-10-14 02:26:12,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1565251.3333333333, ans=0.0 2023-10-14 02:26:12,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-10-14 02:26:14,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1565298.0, ans=0.125 2023-10-14 02:26:16,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1565298.0, ans=0.0 2023-10-14 02:26:25,475 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-10-14 02:26:31,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1565344.6666666667, ans=0.0 2023-10-14 02:26:33,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1565344.6666666667, ans=0.125 2023-10-14 02:26:36,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1565344.6666666667, ans=0.0 2023-10-14 02:26:39,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1565344.6666666667, ans=0.125 2023-10-14 02:26:47,231 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.44 vs. limit=5.0 2023-10-14 02:26:57,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1565438.0, ans=0.1 2023-10-14 02:27:24,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.782e+02 1.917e+02 2.122e+02 2.951e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-14 02:27:25,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-10-14 02:27:48,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1565624.6666666667, ans=0.05 2023-10-14 02:27:53,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1565624.6666666667, ans=0.125 2023-10-14 02:28:13,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1565718.0, ans=0.05 2023-10-14 02:28:14,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1565718.0, ans=0.125 2023-10-14 02:28:27,367 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.56 vs. limit=15.0 2023-10-14 02:28:29,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1565764.6666666667, ans=0.125 2023-10-14 02:29:09,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1565904.6666666667, ans=0.07 2023-10-14 02:29:16,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1565904.6666666667, ans=0.0 2023-10-14 02:29:23,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1565951.3333333333, ans=0.0 2023-10-14 02:29:38,895 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.856e+02 2.075e+02 2.261e+02 3.083e+02, threshold=4.150e+02, percent-clipped=0.0 2023-10-14 02:30:16,882 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=6.0 2023-10-14 02:30:39,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1566231.3333333333, ans=0.125 2023-10-14 02:30:47,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1566231.3333333333, ans=0.0 2023-10-14 02:31:03,065 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:31:07,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1566324.6666666667, ans=0.2 2023-10-14 02:31:12,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1566324.6666666667, ans=0.125 2023-10-14 02:31:19,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.33 vs. limit=12.0 2023-10-14 02:31:29,686 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-10-14 02:31:49,517 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.724e+02 1.889e+02 2.062e+02 3.073e+02, threshold=3.779e+02, percent-clipped=0.0 2023-10-14 02:32:01,576 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:32:07,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1566511.3333333333, ans=0.2 2023-10-14 02:32:13,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1566558.0, ans=0.1 2023-10-14 02:32:38,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1566651.3333333333, ans=0.125 2023-10-14 02:32:50,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1566698.0, ans=0.1 2023-10-14 02:32:51,101 INFO [train.py:1031] (0/4) Epoch 25, batch 8000, loss[loss=0.1678, simple_loss=0.2383, pruned_loss=0.04865, over 12825.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2789, pruned_loss=0.04726, over 32218260.57 frames. ], batch size: 440, lr: 1.37e-03, grad_scale: 32.0 2023-10-14 02:32:57,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1566698.0, ans=0.2 2023-10-14 02:33:04,923 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.60 vs. limit=10.0 2023-10-14 02:33:25,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1566791.3333333333, ans=0.0 2023-10-14 02:33:40,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1566884.6666666667, ans=0.125 2023-10-14 02:33:42,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1566884.6666666667, ans=0.125 2023-10-14 02:33:49,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1566884.6666666667, ans=0.1 2023-10-14 02:33:53,220 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=5.93 vs. limit=15.0 2023-10-14 02:33:58,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1566931.3333333333, ans=0.125 2023-10-14 02:34:00,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.696e+02 1.934e+02 2.291e+02 3.190e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-14 02:34:01,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1566931.3333333333, ans=0.1 2023-10-14 02:34:27,680 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-10-14 02:34:53,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1567071.3333333333, ans=0.125 2023-10-14 02:34:54,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1567071.3333333333, ans=0.0 2023-10-14 02:34:59,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1567071.3333333333, ans=0.1 2023-10-14 02:35:32,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1567211.3333333333, ans=0.125 2023-10-14 02:36:22,003 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:36:28,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1567398.0, ans=0.125 2023-10-14 02:36:31,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.825e+02 1.991e+02 2.215e+02 3.886e+02, threshold=3.981e+02, percent-clipped=1.0 2023-10-14 02:36:38,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1567398.0, ans=0.0 2023-10-14 02:37:05,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1567491.3333333333, ans=0.125 2023-10-14 02:37:13,498 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-10-14 02:37:25,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1567538.0, ans=0.07 2023-10-14 02:37:30,677 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.73 vs. limit=6.0 2023-10-14 02:37:33,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1567538.0, ans=0.125 2023-10-14 02:37:39,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-10-14 02:37:45,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1567584.6666666667, ans=0.125 2023-10-14 02:37:58,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1567631.3333333333, ans=0.2 2023-10-14 02:38:18,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1567678.0, ans=0.125 2023-10-14 02:38:26,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1567724.6666666667, ans=0.2 2023-10-14 02:38:30,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1567724.6666666667, ans=0.0 2023-10-14 02:38:32,489 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.81 vs. limit=22.5 2023-10-14 02:38:34,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-10-14 02:38:47,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1567771.3333333333, ans=0.125 2023-10-14 02:38:51,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1567771.3333333333, ans=0.0 2023-10-14 02:39:06,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1567818.0, ans=0.0 2023-10-14 02:39:06,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1567818.0, ans=0.125 2023-10-14 02:39:10,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1567818.0, ans=0.0 2023-10-14 02:39:18,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.772e+02 1.939e+02 2.097e+02 3.138e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-14 02:39:29,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1567911.3333333333, ans=0.125 2023-10-14 02:39:59,702 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-336000.pt 2023-10-14 02:40:04,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1568004.6666666667, ans=0.1 2023-10-14 02:40:31,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1568051.3333333333, ans=0.125 2023-10-14 02:40:32,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1568051.3333333333, ans=0.0 2023-10-14 02:40:43,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1568098.0, ans=0.125 2023-10-14 02:40:54,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1568144.6666666667, ans=0.125 2023-10-14 02:40:56,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1568144.6666666667, ans=0.1 2023-10-14 02:41:06,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1568144.6666666667, ans=0.2 2023-10-14 02:41:11,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1568191.3333333333, ans=0.0 2023-10-14 02:41:53,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1568331.3333333333, ans=0.0 2023-10-14 02:41:58,539 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.809e+02 1.989e+02 2.246e+02 3.073e+02, threshold=3.979e+02, percent-clipped=0.0 2023-10-14 02:42:24,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1568424.6666666667, ans=0.125 2023-10-14 02:42:27,772 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-10-14 02:42:41,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1568471.3333333333, ans=0.1 2023-10-14 02:42:44,763 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:42:49,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1568518.0, ans=0.1 2023-10-14 02:43:19,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=22.5 2023-10-14 02:43:32,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1568564.6666666667, ans=0.0 2023-10-14 02:43:50,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1568611.3333333333, ans=0.125 2023-10-14 02:44:05,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1568658.0, ans=0.035 2023-10-14 02:44:05,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1568658.0, ans=0.04949747468305833 2023-10-14 02:44:32,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1568704.6666666667, ans=0.2 2023-10-14 02:44:38,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1568751.3333333333, ans=0.0 2023-10-14 02:44:47,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1568751.3333333333, ans=0.1 2023-10-14 02:44:58,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1568751.3333333333, ans=0.125 2023-10-14 02:45:09,007 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.829e+02 1.947e+02 2.133e+02 3.028e+02, threshold=3.894e+02, percent-clipped=0.0 2023-10-14 02:45:52,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1568891.3333333333, ans=0.0 2023-10-14 02:45:53,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=1568891.3333333333, ans=0.05 2023-10-14 02:45:57,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1568938.0, ans=0.125 2023-10-14 02:46:22,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1568984.6666666667, ans=0.125 2023-10-14 02:46:36,421 INFO [train.py:1031] (0/4) Epoch 25, batch 8500, loss[loss=0.1935, simple_loss=0.2819, pruned_loss=0.05261, over 16906.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2795, pruned_loss=0.04737, over 32364383.13 frames. ], batch size: 82, lr: 1.37e-03, grad_scale: 32.0 2023-10-14 02:46:46,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1569031.3333333333, ans=0.125 2023-10-14 02:47:00,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1569078.0, ans=0.125 2023-10-14 02:47:12,613 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.77 vs. limit=22.5 2023-10-14 02:47:27,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1569124.6666666667, ans=0.125 2023-10-14 02:47:46,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1569171.3333333333, ans=0.125 2023-10-14 02:48:20,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1569264.6666666667, ans=0.125 2023-10-14 02:48:28,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.918e+02 2.165e+02 2.461e+02 3.318e+02, threshold=4.331e+02, percent-clipped=0.0 2023-10-14 02:48:38,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1569264.6666666667, ans=0.125 2023-10-14 02:48:39,924 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-10-14 02:48:49,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1569311.3333333333, ans=0.07 2023-10-14 02:50:04,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1569451.3333333333, ans=0.0 2023-10-14 02:51:40,972 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-10-14 02:51:53,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.376e+02 1.716e+02 1.941e+02 2.181e+02 3.660e+02, threshold=3.881e+02, percent-clipped=0.0 2023-10-14 02:52:00,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1569731.3333333333, ans=0.125 2023-10-14 02:52:03,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=15.0 2023-10-14 02:52:12,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1569778.0, ans=0.125 2023-10-14 02:52:12,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1569778.0, ans=0.125 2023-10-14 02:53:45,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1569964.6666666667, ans=0.0 2023-10-14 02:54:00,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1569964.6666666667, ans=0.2 2023-10-14 02:54:27,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1570058.0, ans=0.0 2023-10-14 02:56:02,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.704e+02 1.849e+02 2.088e+02 2.857e+02, threshold=3.698e+02, percent-clipped=0.0 2023-10-14 02:56:28,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.91 vs. limit=10.0 2023-10-14 02:56:41,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1570291.3333333333, ans=0.0 2023-10-14 02:56:46,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1570291.3333333333, ans=0.025 2023-10-14 02:57:02,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1570338.0, ans=0.0 2023-10-14 02:57:03,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1570338.0, ans=0.2 2023-10-14 02:57:03,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1570338.0, ans=0.125 2023-10-14 02:57:07,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1570338.0, ans=0.1 2023-10-14 02:57:44,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1570384.6666666667, ans=0.2 2023-10-14 02:58:06,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1570431.3333333333, ans=0.125 2023-10-14 02:58:12,829 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 02:58:42,497 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.58 vs. limit=6.0 2023-10-14 02:58:49,753 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.93 vs. limit=22.5 2023-10-14 02:58:51,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1570571.3333333333, ans=0.0 2023-10-14 02:58:55,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1570618.0, ans=0.125 2023-10-14 02:59:06,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1570664.6666666667, ans=0.125 2023-10-14 02:59:14,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.725e+02 1.893e+02 2.096e+02 2.628e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-14 02:59:25,162 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.28 vs. limit=15.0 2023-10-14 02:59:25,232 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-10-14 02:59:30,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1570758.0, ans=0.125 2023-10-14 03:00:06,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1570898.0, ans=0.0 2023-10-14 03:00:21,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1570944.6666666667, ans=0.2 2023-10-14 03:00:49,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1571084.6666666667, ans=0.125 2023-10-14 03:00:52,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1571084.6666666667, ans=0.05 2023-10-14 03:00:53,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1571084.6666666667, ans=0.125 2023-10-14 03:00:55,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1571084.6666666667, ans=0.125 2023-10-14 03:01:05,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.804e+02 1.975e+02 2.129e+02 2.551e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-14 03:01:16,004 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.62 vs. limit=15.0 2023-10-14 03:01:17,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1571178.0, ans=0.0 2023-10-14 03:01:21,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1571224.6666666667, ans=0.0 2023-10-14 03:01:22,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1571224.6666666667, ans=0.125 2023-10-14 03:01:39,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1571271.3333333333, ans=0.125 2023-10-14 03:01:46,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1571318.0, ans=0.2 2023-10-14 03:01:47,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1571318.0, ans=0.125 2023-10-14 03:01:54,980 INFO [train.py:1031] (0/4) Epoch 25, batch 9000, loss[loss=0.2082, simple_loss=0.3046, pruned_loss=0.05593, over 16480.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2789, pruned_loss=0.04707, over 32484725.33 frames. ], batch size: 266, lr: 1.37e-03, grad_scale: 8.0 2023-10-14 03:02:00,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1571364.6666666667, ans=0.1 2023-10-14 03:02:07,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1571411.3333333333, ans=0.125 2023-10-14 03:02:16,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1571411.3333333333, ans=0.125 2023-10-14 03:02:31,483 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-10-14 03:02:54,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-10-14 03:02:58,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.804e+02 2.003e+02 2.238e+02 3.238e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-14 03:03:15,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1571691.3333333333, ans=0.125 2023-10-14 03:03:23,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1571691.3333333333, ans=0.125 2023-10-14 03:03:28,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1571738.0, ans=0.125 2023-10-14 03:03:33,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1571738.0, ans=0.05 2023-10-14 03:04:07,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1571924.6666666667, ans=0.04949747468305833 2023-10-14 03:04:11,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1571924.6666666667, ans=0.2 2023-10-14 03:04:11,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1571924.6666666667, ans=0.125 2023-10-14 03:04:18,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1571971.3333333333, ans=0.125 2023-10-14 03:04:30,847 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.32 vs. limit=22.5 2023-10-14 03:04:39,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1572018.0, ans=0.0 2023-10-14 03:04:48,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.751e+02 1.880e+02 2.112e+02 2.668e+02, threshold=3.761e+02, percent-clipped=0.0 2023-10-14 03:04:52,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1572111.3333333333, ans=0.1 2023-10-14 03:05:20,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1572204.6666666667, ans=0.125 2023-10-14 03:05:25,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1572251.3333333333, ans=0.035 2023-10-14 03:05:40,807 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.78 vs. limit=15.0 2023-10-14 03:05:53,561 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.44 vs. limit=15.0 2023-10-14 03:06:02,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1572391.3333333333, ans=0.5 2023-10-14 03:06:03,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1572391.3333333333, ans=0.125 2023-10-14 03:06:11,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1572438.0, ans=0.125 2023-10-14 03:06:22,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1572484.6666666667, ans=10.0 2023-10-14 03:06:26,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1572484.6666666667, ans=0.1 2023-10-14 03:06:34,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1572531.3333333333, ans=0.125 2023-10-14 03:06:36,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.842e+02 2.044e+02 2.450e+02 3.527e+02, threshold=4.088e+02, percent-clipped=0.0 2023-10-14 03:06:52,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1572624.6666666667, ans=0.125 2023-10-14 03:07:37,169 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.39 vs. limit=22.5 2023-10-14 03:07:39,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1572811.3333333333, ans=0.125 2023-10-14 03:08:18,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1572951.3333333333, ans=0.125 2023-10-14 03:08:20,370 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:08:32,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.853e+02 2.051e+02 2.272e+02 2.849e+02, threshold=4.102e+02, percent-clipped=0.0 2023-10-14 03:08:38,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.99 vs. limit=22.5 2023-10-14 03:08:54,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1573091.3333333333, ans=0.125 2023-10-14 03:09:26,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1573184.6666666667, ans=0.0 2023-10-14 03:09:28,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1573231.3333333333, ans=0.1 2023-10-14 03:09:30,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1573231.3333333333, ans=0.07 2023-10-14 03:09:40,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1573278.0, ans=0.125 2023-10-14 03:09:47,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1573278.0, ans=0.0 2023-10-14 03:09:48,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.16 vs. limit=15.0 2023-10-14 03:09:55,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1573324.6666666667, ans=0.05 2023-10-14 03:10:30,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1573464.6666666667, ans=0.125 2023-10-14 03:10:30,942 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.86 vs. limit=22.5 2023-10-14 03:10:33,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.832e+02 2.006e+02 2.217e+02 3.200e+02, threshold=4.013e+02, percent-clipped=0.0 2023-10-14 03:10:54,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1573558.0, ans=0.0 2023-10-14 03:11:02,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1573604.6666666667, ans=10.0 2023-10-14 03:11:27,000 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.06 vs. limit=10.0 2023-10-14 03:11:27,382 INFO [train.py:1031] (0/4) Epoch 25, batch 9500, loss[loss=0.1927, simple_loss=0.2939, pruned_loss=0.04572, over 16584.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2794, pruned_loss=0.04717, over 32556525.13 frames. ], batch size: 241, lr: 1.37e-03, grad_scale: 16.0 2023-10-14 03:11:41,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1573744.6666666667, ans=0.0 2023-10-14 03:12:18,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1573884.6666666667, ans=0.125 2023-10-14 03:12:30,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.841e+02 1.997e+02 2.236e+02 2.993e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-14 03:12:44,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1573978.0, ans=0.0 2023-10-14 03:12:53,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1574024.6666666667, ans=0.125 2023-10-14 03:13:15,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1574118.0, ans=10.0 2023-10-14 03:13:35,116 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.11 vs. limit=10.0 2023-10-14 03:13:42,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1574211.3333333333, ans=0.0 2023-10-14 03:13:46,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1574258.0, ans=0.125 2023-10-14 03:13:47,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1574258.0, ans=0.125 2023-10-14 03:13:51,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1574258.0, ans=0.125 2023-10-14 03:14:01,336 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=15.0 2023-10-14 03:14:15,701 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.69 vs. limit=10.0 2023-10-14 03:14:23,843 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.11 vs. limit=12.0 2023-10-14 03:14:24,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.795e+02 1.927e+02 2.122e+02 2.768e+02, threshold=3.853e+02, percent-clipped=0.0 2023-10-14 03:14:38,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1574444.6666666667, ans=0.0 2023-10-14 03:14:52,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574538.0, ans=0.1 2023-10-14 03:14:58,236 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-10-14 03:15:11,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1574584.6666666667, ans=0.125 2023-10-14 03:15:16,508 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:15:43,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1574724.6666666667, ans=0.125 2023-10-14 03:15:47,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1574771.3333333333, ans=0.1 2023-10-14 03:15:59,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-14 03:16:00,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1574818.0, ans=6.0 2023-10-14 03:16:08,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1574818.0, ans=0.0 2023-10-14 03:16:17,371 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.782e+02 1.972e+02 2.170e+02 3.443e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-14 03:16:24,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1574911.3333333333, ans=0.0 2023-10-14 03:16:33,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1574958.0, ans=0.125 2023-10-14 03:16:50,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1575004.6666666667, ans=0.125 2023-10-14 03:17:03,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1575051.3333333333, ans=0.125 2023-10-14 03:17:18,028 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:18:09,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1575331.3333333333, ans=0.125 2023-10-14 03:18:10,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.733e+02 1.922e+02 2.137e+02 3.004e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-14 03:18:32,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1575424.6666666667, ans=0.125 2023-10-14 03:18:54,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1575518.0, ans=0.125 2023-10-14 03:18:58,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1575518.0, ans=0.0 2023-10-14 03:18:59,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1575518.0, ans=0.2 2023-10-14 03:19:04,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1575564.6666666667, ans=0.07 2023-10-14 03:19:26,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1575658.0, ans=0.0 2023-10-14 03:19:30,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1575658.0, ans=0.125 2023-10-14 03:19:47,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1575751.3333333333, ans=0.0 2023-10-14 03:19:53,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1575798.0, ans=0.125 2023-10-14 03:19:53,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1575798.0, ans=0.125 2023-10-14 03:20:00,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.788e+02 1.939e+02 2.124e+02 2.768e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-14 03:20:16,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1575891.3333333333, ans=0.125 2023-10-14 03:20:23,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1575938.0, ans=0.1 2023-10-14 03:20:25,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1575938.0, ans=0.125 2023-10-14 03:20:39,020 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.59 vs. limit=12.0 2023-10-14 03:20:44,448 INFO [train.py:1031] (0/4) Epoch 25, batch 10000, loss[loss=0.1811, simple_loss=0.2705, pruned_loss=0.04583, over 15504.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2785, pruned_loss=0.04692, over 32594407.65 frames. ], batch size: 35, lr: 1.37e-03, grad_scale: 32.0 2023-10-14 03:20:49,929 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:20:54,737 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-10-14 03:20:57,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1576078.0, ans=0.0 2023-10-14 03:21:17,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1576171.3333333333, ans=0.05 2023-10-14 03:21:20,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1576171.3333333333, ans=0.125 2023-10-14 03:21:23,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1576171.3333333333, ans=0.125 2023-10-14 03:21:25,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1576218.0, ans=0.125 2023-10-14 03:21:39,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1576264.6666666667, ans=0.0 2023-10-14 03:21:44,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.836e+02 2.001e+02 2.218e+02 3.537e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-14 03:22:17,739 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.99 vs. limit=22.5 2023-10-14 03:22:19,333 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=12.0 2023-10-14 03:22:58,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1576591.3333333333, ans=0.125 2023-10-14 03:23:21,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1576684.6666666667, ans=0.125 2023-10-14 03:23:37,256 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.807e+02 1.989e+02 2.260e+02 3.052e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-14 03:23:48,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1576778.0, ans=0.125 2023-10-14 03:23:53,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1576824.6666666667, ans=0.0 2023-10-14 03:23:53,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.15 vs. limit=6.0 2023-10-14 03:23:57,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1576824.6666666667, ans=0.125 2023-10-14 03:24:06,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1576871.3333333333, ans=0.2 2023-10-14 03:24:10,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1576918.0, ans=0.125 2023-10-14 03:24:51,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1577058.0, ans=0.125 2023-10-14 03:25:01,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1577104.6666666667, ans=0.125 2023-10-14 03:25:11,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1577151.3333333333, ans=0.07 2023-10-14 03:25:12,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1577151.3333333333, ans=0.125 2023-10-14 03:25:14,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1577151.3333333333, ans=0.1 2023-10-14 03:25:16,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1577151.3333333333, ans=0.125 2023-10-14 03:25:17,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1577151.3333333333, ans=0.125 2023-10-14 03:25:26,198 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-10-14 03:25:28,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.828e+02 2.014e+02 2.193e+02 3.297e+02, threshold=4.028e+02, percent-clipped=0.0 2023-10-14 03:25:43,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.33 vs. limit=22.5 2023-10-14 03:25:54,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-10-14 03:26:17,051 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.46 vs. limit=10.0 2023-10-14 03:26:31,717 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.98 vs. limit=15.0 2023-10-14 03:26:32,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1577478.0, ans=0.125 2023-10-14 03:26:42,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1577524.6666666667, ans=0.125 2023-10-14 03:26:43,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1577524.6666666667, ans=0.125 2023-10-14 03:26:44,552 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-10-14 03:27:11,795 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.36 vs. limit=10.0 2023-10-14 03:27:17,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1577664.6666666667, ans=0.0 2023-10-14 03:27:23,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.846e+02 1.968e+02 2.166e+02 2.797e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-14 03:27:25,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1577711.3333333333, ans=0.1 2023-10-14 03:27:39,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1577758.0, ans=0.0 2023-10-14 03:27:41,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1577758.0, ans=0.125 2023-10-14 03:28:18,815 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-10-14 03:28:19,882 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-10-14 03:28:23,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1577944.6666666667, ans=0.2 2023-10-14 03:28:29,145 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.23 vs. limit=15.0 2023-10-14 03:28:43,473 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.94 vs. limit=6.0 2023-10-14 03:29:03,039 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.81 vs. limit=15.0 2023-10-14 03:29:07,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1578131.3333333333, ans=0.125 2023-10-14 03:29:11,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1578131.3333333333, ans=0.0 2023-10-14 03:29:13,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1578131.3333333333, ans=0.1 2023-10-14 03:29:15,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.747e+02 1.890e+02 2.109e+02 2.629e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-14 03:29:32,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1578224.6666666667, ans=0.95 2023-10-14 03:29:41,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1578271.3333333333, ans=0.125 2023-10-14 03:29:47,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1578271.3333333333, ans=0.125 2023-10-14 03:29:53,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1578318.0, ans=0.04949747468305833 2023-10-14 03:29:53,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1578318.0, ans=0.125 2023-10-14 03:29:59,035 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-14 03:30:01,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1578318.0, ans=0.125 2023-10-14 03:30:02,928 INFO [train.py:1031] (0/4) Epoch 25, batch 10500, loss[loss=0.1736, simple_loss=0.2645, pruned_loss=0.0413, over 16203.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.279, pruned_loss=0.04712, over 32642941.48 frames. ], batch size: 44, lr: 1.37e-03, grad_scale: 16.0 2023-10-14 03:30:11,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-10-14 03:30:11,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1578411.3333333333, ans=0.125 2023-10-14 03:30:23,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1578458.0, ans=0.0 2023-10-14 03:30:24,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1578458.0, ans=0.125 2023-10-14 03:30:26,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1578458.0, ans=0.0 2023-10-14 03:30:55,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1578598.0, ans=0.5 2023-10-14 03:30:58,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1578598.0, ans=0.0 2023-10-14 03:30:58,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1578598.0, ans=0.125 2023-10-14 03:31:00,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1578598.0, ans=0.1 2023-10-14 03:31:00,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1578598.0, ans=0.125 2023-10-14 03:31:04,580 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.947e+02 2.191e+02 2.473e+02 3.920e+02, threshold=4.382e+02, percent-clipped=1.0 2023-10-14 03:31:05,319 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.25 vs. limit=15.0 2023-10-14 03:31:25,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1578691.3333333333, ans=0.0 2023-10-14 03:31:50,534 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-10-14 03:31:59,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1578831.3333333333, ans=0.0 2023-10-14 03:32:16,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1578878.0, ans=0.1 2023-10-14 03:32:17,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1578878.0, ans=0.0 2023-10-14 03:32:30,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1578924.6666666667, ans=0.125 2023-10-14 03:32:40,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1578971.3333333333, ans=0.0 2023-10-14 03:32:58,360 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.30 vs. limit=15.0 2023-10-14 03:33:03,693 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.861e+02 1.992e+02 2.125e+02 2.838e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-14 03:33:45,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1579251.3333333333, ans=0.0 2023-10-14 03:33:47,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1579251.3333333333, ans=0.125 2023-10-14 03:33:51,498 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.27 vs. limit=22.5 2023-10-14 03:34:15,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1579391.3333333333, ans=0.2 2023-10-14 03:34:49,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=15.0 2023-10-14 03:34:54,122 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.04 vs. limit=12.0 2023-10-14 03:34:54,529 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.813e+02 1.997e+02 2.193e+02 3.314e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-14 03:34:56,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1579578.0, ans=0.125 2023-10-14 03:34:57,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1579578.0, ans=0.2 2023-10-14 03:35:07,520 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-10-14 03:35:24,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1579671.3333333333, ans=0.125 2023-10-14 03:35:29,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.55 vs. limit=15.0 2023-10-14 03:35:43,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1579764.6666666667, ans=0.125 2023-10-14 03:35:45,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1579764.6666666667, ans=0.1 2023-10-14 03:35:50,504 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-10-14 03:35:55,013 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-10-14 03:35:56,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1579811.3333333333, ans=0.1 2023-10-14 03:36:01,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1579858.0, ans=0.125 2023-10-14 03:36:11,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1579904.6666666667, ans=0.0 2023-10-14 03:36:11,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1579904.6666666667, ans=0.5 2023-10-14 03:36:21,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1579951.3333333333, ans=0.125 2023-10-14 03:36:27,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1579951.3333333333, ans=0.5 2023-10-14 03:36:34,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1579998.0, ans=0.07 2023-10-14 03:36:34,218 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.56 vs. limit=15.0 2023-10-14 03:36:34,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1579998.0, ans=0.125 2023-10-14 03:36:34,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1579998.0, ans=0.125 2023-10-14 03:36:37,826 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.18 vs. limit=15.0 2023-10-14 03:36:41,815 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.897e+02 2.045e+02 2.310e+02 3.292e+02, threshold=4.089e+02, percent-clipped=0.0 2023-10-14 03:37:29,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1580231.3333333333, ans=0.1 2023-10-14 03:37:34,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1580231.3333333333, ans=0.125 2023-10-14 03:37:47,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1580278.0, ans=0.0 2023-10-14 03:37:56,935 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=22.5 2023-10-14 03:38:34,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1580464.6666666667, ans=0.125 2023-10-14 03:38:35,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.722e+02 1.877e+02 2.049e+02 3.003e+02, threshold=3.753e+02, percent-clipped=0.0 2023-10-14 03:38:41,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1580511.3333333333, ans=0.2 2023-10-14 03:38:49,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1580558.0, ans=0.1 2023-10-14 03:38:50,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1580558.0, ans=0.0 2023-10-14 03:38:52,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1580558.0, ans=0.125 2023-10-14 03:39:10,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1580651.3333333333, ans=0.1 2023-10-14 03:39:18,111 INFO [train.py:1031] (0/4) Epoch 25, batch 11000, loss[loss=0.1917, simple_loss=0.2851, pruned_loss=0.04916, over 16645.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2791, pruned_loss=0.04727, over 32675153.13 frames. ], batch size: 61, lr: 1.37e-03, grad_scale: 16.0 2023-10-14 03:39:42,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1580791.3333333333, ans=0.125 2023-10-14 03:39:49,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1580838.0, ans=0.0 2023-10-14 03:40:05,577 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-10-14 03:40:16,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1580931.3333333333, ans=0.125 2023-10-14 03:40:25,023 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.890e+02 2.048e+02 2.245e+02 3.372e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-14 03:41:16,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1581164.6666666667, ans=0.025 2023-10-14 03:41:24,175 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.44 vs. limit=15.0 2023-10-14 03:41:39,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-10-14 03:42:20,971 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:42:25,529 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.55 vs. limit=12.0 2023-10-14 03:42:28,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.716e+02 1.873e+02 2.034e+02 2.622e+02, threshold=3.747e+02, percent-clipped=0.0 2023-10-14 03:42:43,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1581491.3333333333, ans=0.0 2023-10-14 03:42:49,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1581491.3333333333, ans=0.125 2023-10-14 03:42:51,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1581538.0, ans=0.0 2023-10-14 03:42:53,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1581538.0, ans=0.0 2023-10-14 03:42:57,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1581538.0, ans=0.125 2023-10-14 03:43:01,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1581538.0, ans=0.0 2023-10-14 03:43:06,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1581584.6666666667, ans=0.0 2023-10-14 03:43:14,397 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.86 vs. limit=15.0 2023-10-14 03:43:15,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581631.3333333333, ans=0.1 2023-10-14 03:43:23,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1581678.0, ans=0.2 2023-10-14 03:43:34,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1581724.6666666667, ans=0.125 2023-10-14 03:43:35,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1581724.6666666667, ans=0.125 2023-10-14 03:43:40,124 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.82 vs. limit=10.0 2023-10-14 03:43:55,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1581818.0, ans=0.125 2023-10-14 03:44:00,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1581818.0, ans=0.0 2023-10-14 03:44:07,133 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.615e-03 2023-10-14 03:44:12,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1581864.6666666667, ans=0.0 2023-10-14 03:44:18,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.804e+02 1.994e+02 2.206e+02 3.489e+02, threshold=3.988e+02, percent-clipped=0.0 2023-10-14 03:44:29,318 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.83 vs. limit=22.5 2023-10-14 03:44:31,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1581958.0, ans=0.1 2023-10-14 03:44:36,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1581958.0, ans=0.1 2023-10-14 03:44:38,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=12.0 2023-10-14 03:44:45,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1582004.6666666667, ans=0.035 2023-10-14 03:44:57,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1582051.3333333333, ans=0.2 2023-10-14 03:45:24,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1582144.6666666667, ans=0.0 2023-10-14 03:45:24,417 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.45 vs. limit=15.0 2023-10-14 03:45:50,272 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.26 vs. limit=22.5 2023-10-14 03:46:17,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.840e+02 2.018e+02 2.259e+02 3.158e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-14 03:46:40,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1582471.3333333333, ans=0.125 2023-10-14 03:46:42,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1582471.3333333333, ans=0.125 2023-10-14 03:46:44,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1582471.3333333333, ans=0.1 2023-10-14 03:46:44,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1582471.3333333333, ans=0.125 2023-10-14 03:46:55,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1582518.0, ans=0.1 2023-10-14 03:47:02,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.50 vs. limit=15.0 2023-10-14 03:47:06,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1582564.6666666667, ans=0.1 2023-10-14 03:47:06,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1582564.6666666667, ans=0.0 2023-10-14 03:47:11,163 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-10-14 03:47:14,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1582611.3333333333, ans=0.2 2023-10-14 03:47:47,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1582704.6666666667, ans=0.0 2023-10-14 03:47:57,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1582751.3333333333, ans=0.1 2023-10-14 03:48:01,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1582798.0, ans=0.0 2023-10-14 03:48:07,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1582798.0, ans=0.125 2023-10-14 03:48:11,546 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.891e+02 2.027e+02 2.213e+02 3.067e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-14 03:48:15,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1582844.6666666667, ans=0.95 2023-10-14 03:48:15,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1582844.6666666667, ans=0.125 2023-10-14 03:48:28,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1582891.3333333333, ans=0.125 2023-10-14 03:48:30,597 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 03:48:41,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1582938.0, ans=0.0 2023-10-14 03:48:52,476 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.57 vs. limit=15.0 2023-10-14 03:48:56,739 INFO [train.py:1031] (0/4) Epoch 25, batch 11500, loss[loss=0.1928, simple_loss=0.293, pruned_loss=0.04628, over 16612.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2788, pruned_loss=0.04719, over 32686974.71 frames. ], batch size: 219, lr: 1.37e-03, grad_scale: 32.0 2023-10-14 03:49:17,595 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.88 vs. limit=15.0 2023-10-14 03:49:36,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1583171.3333333333, ans=0.125 2023-10-14 03:49:56,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1583264.6666666667, ans=0.125 2023-10-14 03:50:04,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1583311.3333333333, ans=0.1 2023-10-14 03:50:04,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.829e+02 2.014e+02 2.258e+02 2.902e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-14 03:50:26,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1583404.6666666667, ans=0.125 2023-10-14 03:50:39,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1583404.6666666667, ans=0.0 2023-10-14 03:50:45,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1583451.3333333333, ans=0.025 2023-10-14 03:50:52,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1583451.3333333333, ans=0.125 2023-10-14 03:51:02,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1583498.0, ans=0.125 2023-10-14 03:51:08,831 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.38 vs. limit=22.5 2023-10-14 03:51:26,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1583591.3333333333, ans=0.125 2023-10-14 03:51:31,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1583638.0, ans=0.2 2023-10-14 03:51:45,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1583684.6666666667, ans=0.125 2023-10-14 03:52:02,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.784e+02 1.928e+02 2.155e+02 4.130e+02, threshold=3.856e+02, percent-clipped=1.0 2023-10-14 03:52:04,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1583778.0, ans=0.0 2023-10-14 03:52:07,695 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-10-14 03:52:08,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1583778.0, ans=0.125 2023-10-14 03:52:18,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1583824.6666666667, ans=0.0 2023-10-14 03:52:20,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1583824.6666666667, ans=0.0 2023-10-14 03:52:26,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-14 03:52:29,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.44 vs. limit=10.0 2023-10-14 03:53:08,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1584058.0, ans=0.0 2023-10-14 03:53:14,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1584058.0, ans=15.0 2023-10-14 03:54:01,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.811e+02 1.953e+02 2.098e+02 2.891e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-14 03:54:01,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1584244.6666666667, ans=0.125 2023-10-14 03:54:22,881 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-10-14 03:54:44,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1584384.6666666667, ans=0.125 2023-10-14 03:55:05,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1584478.0, ans=0.2 2023-10-14 03:55:12,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1584478.0, ans=0.2 2023-10-14 03:55:14,443 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.63 vs. limit=15.0 2023-10-14 03:55:17,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1584524.6666666667, ans=0.2 2023-10-14 03:55:38,851 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=6.0 2023-10-14 03:55:39,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1584618.0, ans=0.125 2023-10-14 03:55:42,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1584618.0, ans=0.1 2023-10-14 03:55:43,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1584618.0, ans=0.125 2023-10-14 03:55:46,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1584618.0, ans=0.125 2023-10-14 03:55:51,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1584664.6666666667, ans=0.125 2023-10-14 03:55:52,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1584664.6666666667, ans=0.2 2023-10-14 03:56:00,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.807e+02 1.991e+02 2.168e+02 2.837e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-14 03:56:04,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1584711.3333333333, ans=0.1 2023-10-14 03:56:07,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1584711.3333333333, ans=0.2 2023-10-14 03:56:29,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1584804.6666666667, ans=0.125 2023-10-14 03:56:30,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1584804.6666666667, ans=0.1 2023-10-14 03:56:32,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-10-14 03:56:37,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1584851.3333333333, ans=0.125 2023-10-14 03:56:44,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1584851.3333333333, ans=0.125 2023-10-14 03:56:54,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1584898.0, ans=0.125 2023-10-14 03:57:03,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1584944.6666666667, ans=0.125 2023-10-14 03:57:03,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1584944.6666666667, ans=0.0 2023-10-14 03:57:07,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1584944.6666666667, ans=0.125 2023-10-14 03:57:18,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1584991.3333333333, ans=0.1 2023-10-14 03:57:22,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.22 vs. limit=15.0 2023-10-14 03:57:25,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1585038.0, ans=0.2 2023-10-14 03:57:27,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1585038.0, ans=0.125 2023-10-14 03:57:48,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1585131.3333333333, ans=0.125 2023-10-14 03:57:53,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1585131.3333333333, ans=0.125 2023-10-14 03:57:57,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.764e+02 1.945e+02 2.100e+02 2.828e+02, threshold=3.889e+02, percent-clipped=0.0 2023-10-14 03:58:04,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1585178.0, ans=0.125 2023-10-14 03:58:05,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1585178.0, ans=0.125 2023-10-14 03:58:31,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1585318.0, ans=0.015 2023-10-14 03:58:41,683 INFO [train.py:1031] (0/4) Epoch 25, batch 12000, loss[loss=0.1859, simple_loss=0.2812, pruned_loss=0.04528, over 16915.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.279, pruned_loss=0.04683, over 32735358.15 frames. ], batch size: 138, lr: 1.36e-03, grad_scale: 32.0 2023-10-14 03:58:44,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1585364.6666666667, ans=0.025 2023-10-14 03:59:06,611 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.95 vs. limit=10.0 2023-10-14 03:59:14,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.35 vs. limit=12.0 2023-10-14 03:59:19,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1585504.6666666667, ans=0.0 2023-10-14 03:59:21,124 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.14 vs. limit=22.5 2023-10-14 03:59:22,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1585504.6666666667, ans=0.125 2023-10-14 03:59:23,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1585504.6666666667, ans=0.2 2023-10-14 03:59:42,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1585598.0, ans=0.1 2023-10-14 03:59:52,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.840e+02 2.019e+02 2.275e+02 3.419e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-14 04:00:02,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1585644.6666666667, ans=0.05 2023-10-14 04:00:05,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1585691.3333333333, ans=0.0 2023-10-14 04:00:16,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1585738.0, ans=0.125 2023-10-14 04:00:20,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-10-14 04:00:22,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1585738.0, ans=0.0 2023-10-14 04:00:23,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1585738.0, ans=0.0 2023-10-14 04:00:33,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1585784.6666666667, ans=0.0 2023-10-14 04:00:37,521 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-10-14 04:00:42,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1585831.3333333333, ans=0.0 2023-10-14 04:00:59,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1585924.6666666667, ans=0.1 2023-10-14 04:01:09,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1585971.3333333333, ans=0.0 2023-10-14 04:01:19,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1586018.0, ans=0.0 2023-10-14 04:01:24,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1586018.0, ans=0.0 2023-10-14 04:01:44,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.768e+02 1.931e+02 2.169e+02 3.392e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-14 04:01:48,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1586111.3333333333, ans=0.125 2023-10-14 04:02:01,545 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.28 vs. limit=15.0 2023-10-14 04:02:06,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1586204.6666666667, ans=0.0 2023-10-14 04:02:06,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.38 vs. limit=15.0 2023-10-14 04:02:19,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1586251.3333333333, ans=0.1 2023-10-14 04:02:24,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.60 vs. limit=15.0 2023-10-14 04:02:32,250 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.26 vs. limit=22.5 2023-10-14 04:02:37,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1586298.0, ans=0.125 2023-10-14 04:02:52,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1586391.3333333333, ans=0.0 2023-10-14 04:02:53,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1586391.3333333333, ans=0.125 2023-10-14 04:03:04,271 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=15.0 2023-10-14 04:03:08,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1586438.0, ans=0.125 2023-10-14 04:03:08,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1586438.0, ans=0.125 2023-10-14 04:03:21,380 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:03:30,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1586531.3333333333, ans=0.125 2023-10-14 04:03:33,565 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.854e+02 2.003e+02 2.173e+02 4.272e+02, threshold=4.005e+02, percent-clipped=1.0 2023-10-14 04:03:55,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1586671.3333333333, ans=0.125 2023-10-14 04:03:56,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1586671.3333333333, ans=0.125 2023-10-14 04:03:56,572 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:03:59,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1586671.3333333333, ans=0.0 2023-10-14 04:04:24,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1586764.6666666667, ans=0.2 2023-10-14 04:04:32,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1586811.3333333333, ans=0.0 2023-10-14 04:04:41,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1586858.0, ans=0.125 2023-10-14 04:04:45,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1586858.0, ans=0.125 2023-10-14 04:04:49,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1586858.0, ans=0.04949747468305833 2023-10-14 04:05:06,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1586951.3333333333, ans=0.125 2023-10-14 04:05:11,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1586951.3333333333, ans=0.0 2023-10-14 04:05:12,403 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.22 vs. limit=22.5 2023-10-14 04:05:16,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1586998.0, ans=0.0 2023-10-14 04:05:22,956 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-10-14 04:05:25,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-10-14 04:05:27,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587044.6666666667, ans=0.1 2023-10-14 04:05:28,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.787e+02 1.905e+02 2.121e+02 2.986e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-14 04:05:28,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1587044.6666666667, ans=0.0 2023-10-14 04:06:02,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-10-14 04:06:35,524 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.53 vs. limit=15.0 2023-10-14 04:06:40,978 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.16 vs. limit=15.0 2023-10-14 04:06:47,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1587371.3333333333, ans=0.125 2023-10-14 04:06:50,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1587371.3333333333, ans=0.1 2023-10-14 04:07:22,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1587464.6666666667, ans=0.0 2023-10-14 04:07:24,633 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.843e+02 1.993e+02 2.239e+02 3.740e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-14 04:07:26,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1587511.3333333333, ans=0.0 2023-10-14 04:07:26,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1587511.3333333333, ans=0.125 2023-10-14 04:07:30,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.19 vs. limit=15.0 2023-10-14 04:08:09,073 INFO [train.py:1031] (0/4) Epoch 25, batch 12500, loss[loss=0.2126, simple_loss=0.3028, pruned_loss=0.06117, over 16660.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2787, pruned_loss=0.04692, over 32742519.16 frames. ], batch size: 241, lr: 1.36e-03, grad_scale: 32.0 2023-10-14 04:08:10,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1587698.0, ans=0.2 2023-10-14 04:08:23,924 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-10-14 04:08:40,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1587791.3333333333, ans=0.125 2023-10-14 04:09:06,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.95 vs. limit=22.5 2023-10-14 04:09:15,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.752e+02 1.888e+02 2.132e+02 2.843e+02, threshold=3.776e+02, percent-clipped=0.0 2023-10-14 04:09:18,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1587978.0, ans=0.125 2023-10-14 04:09:22,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1587978.0, ans=0.125 2023-10-14 04:09:32,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1588024.6666666667, ans=0.125 2023-10-14 04:09:58,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1588118.0, ans=0.125 2023-10-14 04:10:11,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1588211.3333333333, ans=0.125 2023-10-14 04:10:16,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1588211.3333333333, ans=0.125 2023-10-14 04:10:18,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1588211.3333333333, ans=0.125 2023-10-14 04:10:26,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1588258.0, ans=0.0 2023-10-14 04:10:27,187 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.12 vs. limit=15.0 2023-10-14 04:10:29,813 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2023-10-14 04:10:37,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1588304.6666666667, ans=0.0 2023-10-14 04:11:03,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1588398.0, ans=0.0 2023-10-14 04:11:10,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.841e+02 2.060e+02 2.337e+02 3.341e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-14 04:11:13,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-10-14 04:11:42,338 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-10-14 04:12:02,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1588678.0, ans=0.025 2023-10-14 04:12:05,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1588678.0, ans=0.125 2023-10-14 04:12:42,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1588818.0, ans=0.0 2023-10-14 04:12:57,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.829e+02 2.004e+02 2.201e+02 3.015e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-14 04:13:28,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.20 vs. limit=10.0 2023-10-14 04:13:47,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1589144.6666666667, ans=0.1 2023-10-14 04:13:51,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-10-14 04:14:43,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1589331.3333333333, ans=0.2 2023-10-14 04:14:43,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1589331.3333333333, ans=0.125 2023-10-14 04:14:44,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1589378.0, ans=0.0 2023-10-14 04:14:48,254 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.826e+02 1.979e+02 2.232e+02 2.883e+02, threshold=3.958e+02, percent-clipped=0.0 2023-10-14 04:14:57,111 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-10-14 04:14:58,023 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-14 04:15:03,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.30 vs. limit=22.5 2023-10-14 04:15:12,340 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:15:18,519 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:15:20,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1589518.0, ans=0.125 2023-10-14 04:15:38,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1589564.6666666667, ans=0.125 2023-10-14 04:15:42,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1589611.3333333333, ans=0.125 2023-10-14 04:15:51,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1589658.0, ans=0.0 2023-10-14 04:15:56,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1589658.0, ans=0.1 2023-10-14 04:16:01,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1589704.6666666667, ans=0.125 2023-10-14 04:16:01,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1589704.6666666667, ans=0.2 2023-10-14 04:16:38,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.778e+02 1.891e+02 2.122e+02 2.931e+02, threshold=3.782e+02, percent-clipped=0.0 2023-10-14 04:16:43,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1589844.6666666667, ans=0.2 2023-10-14 04:16:48,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1589891.3333333333, ans=0.125 2023-10-14 04:17:01,722 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.78 vs. limit=15.0 2023-10-14 04:17:03,867 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.46 vs. limit=15.0 2023-10-14 04:17:10,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1589984.6666666667, ans=0.125 2023-10-14 04:17:12,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1589984.6666666667, ans=0.0 2023-10-14 04:17:13,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1589984.6666666667, ans=0.0 2023-10-14 04:17:16,714 INFO [train.py:1031] (0/4) Epoch 25, batch 13000, loss[loss=0.1921, simple_loss=0.2837, pruned_loss=0.05027, over 16974.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2794, pruned_loss=0.04692, over 32780572.54 frames. ], batch size: 117, lr: 1.36e-03, grad_scale: 16.0 2023-10-14 04:17:21,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1590031.3333333333, ans=0.0 2023-10-14 04:17:22,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1590031.3333333333, ans=0.125 2023-10-14 04:17:27,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1590078.0, ans=0.125 2023-10-14 04:17:31,328 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:17:33,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1590078.0, ans=0.0 2023-10-14 04:17:49,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1590124.6666666667, ans=0.125 2023-10-14 04:17:49,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1590124.6666666667, ans=0.2 2023-10-14 04:17:55,194 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-10-14 04:18:05,886 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.92 vs. limit=15.0 2023-10-14 04:18:08,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1590218.0, ans=0.125 2023-10-14 04:18:11,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1590218.0, ans=0.125 2023-10-14 04:18:12,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-10-14 04:18:16,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1590218.0, ans=0.125 2023-10-14 04:18:28,325 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:18:36,054 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.848e+02 2.011e+02 2.291e+02 3.112e+02, threshold=4.022e+02, percent-clipped=0.0 2023-10-14 04:18:37,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.60 vs. limit=15.0 2023-10-14 04:18:48,987 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-10-14 04:19:04,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1590404.6666666667, ans=0.125 2023-10-14 04:19:14,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1590451.3333333333, ans=0.125 2023-10-14 04:19:27,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1590498.0, ans=0.125 2023-10-14 04:19:31,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1590544.6666666667, ans=0.07 2023-10-14 04:19:32,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1590544.6666666667, ans=0.0 2023-10-14 04:19:50,683 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.57 vs. limit=15.0 2023-10-14 04:20:02,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-10-14 04:20:23,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1590731.3333333333, ans=0.125 2023-10-14 04:20:32,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.813e+02 2.024e+02 2.276e+02 3.438e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-14 04:21:01,176 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.82 vs. limit=15.0 2023-10-14 04:21:16,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1590964.6666666667, ans=0.125 2023-10-14 04:21:17,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1590964.6666666667, ans=0.1 2023-10-14 04:21:24,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1590964.6666666667, ans=0.0 2023-10-14 04:21:31,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1591011.3333333333, ans=0.2 2023-10-14 04:21:37,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1591058.0, ans=0.125 2023-10-14 04:21:43,873 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=12.0 2023-10-14 04:21:48,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1591104.6666666667, ans=10.0 2023-10-14 04:21:51,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.73 vs. limit=15.0 2023-10-14 04:21:54,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1591104.6666666667, ans=0.125 2023-10-14 04:22:14,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1591198.0, ans=0.1 2023-10-14 04:22:23,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.739e+02 1.959e+02 2.224e+02 3.224e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-14 04:22:23,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1591244.6666666667, ans=0.015 2023-10-14 04:22:50,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1591338.0, ans=0.1 2023-10-14 04:23:00,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1591384.6666666667, ans=0.0 2023-10-14 04:23:04,387 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:23:06,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1591431.3333333333, ans=0.0 2023-10-14 04:23:17,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1591478.0, ans=0.125 2023-10-14 04:23:33,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1591524.6666666667, ans=0.0 2023-10-14 04:23:33,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=1591524.6666666667, ans=10.0 2023-10-14 04:23:35,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1591524.6666666667, ans=0.125 2023-10-14 04:23:36,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1591524.6666666667, ans=0.125 2023-10-14 04:23:48,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1591618.0, ans=0.0 2023-10-14 04:23:51,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1591618.0, ans=0.1 2023-10-14 04:24:13,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.862e+02 2.005e+02 2.165e+02 2.827e+02, threshold=4.011e+02, percent-clipped=0.0 2023-10-14 04:24:28,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1591758.0, ans=0.0 2023-10-14 04:24:57,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1591898.0, ans=0.125 2023-10-14 04:25:02,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1591898.0, ans=0.125 2023-10-14 04:25:11,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1591944.6666666667, ans=0.125 2023-10-14 04:25:14,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1591944.6666666667, ans=0.0 2023-10-14 04:25:26,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1591991.3333333333, ans=0.2 2023-10-14 04:25:26,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.22 vs. limit=22.5 2023-10-14 04:25:28,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1591991.3333333333, ans=0.125 2023-10-14 04:25:49,864 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.00 vs. limit=22.5 2023-10-14 04:25:55,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1592131.3333333333, ans=0.1 2023-10-14 04:25:57,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1592131.3333333333, ans=0.0 2023-10-14 04:26:07,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1592178.0, ans=0.125 2023-10-14 04:26:07,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1592178.0, ans=0.125 2023-10-14 04:26:08,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.787e+02 2.004e+02 2.276e+02 3.156e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-14 04:26:18,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1592224.6666666667, ans=0.1 2023-10-14 04:26:24,350 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-10-14 04:26:32,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1592271.3333333333, ans=0.5 2023-10-14 04:26:42,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1592318.0, ans=0.035 2023-10-14 04:26:47,137 INFO [train.py:1031] (0/4) Epoch 25, batch 13500, loss[loss=0.1822, simple_loss=0.2761, pruned_loss=0.04418, over 16664.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2786, pruned_loss=0.0467, over 32792749.97 frames. ], batch size: 66, lr: 1.36e-03, grad_scale: 16.0 2023-10-14 04:26:50,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1592364.6666666667, ans=0.125 2023-10-14 04:27:39,212 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-10-14 04:27:43,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1592598.0, ans=0.95 2023-10-14 04:27:44,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1592598.0, ans=0.0 2023-10-14 04:27:47,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1592598.0, ans=0.0 2023-10-14 04:27:53,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1592644.6666666667, ans=0.125 2023-10-14 04:27:57,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.784e+02 1.917e+02 2.135e+02 2.793e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-14 04:28:08,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1592691.3333333333, ans=0.125 2023-10-14 04:28:09,686 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1592691.3333333333, ans=0.125 2023-10-14 04:28:11,669 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.43 vs. limit=15.0 2023-10-14 04:28:15,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1592738.0, ans=0.1 2023-10-14 04:28:15,268 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-10-14 04:28:21,602 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.56 vs. limit=22.5 2023-10-14 04:28:25,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1592784.6666666667, ans=0.0 2023-10-14 04:28:41,924 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:28:41,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1592831.3333333333, ans=0.04949747468305833 2023-10-14 04:29:00,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1592924.6666666667, ans=0.2 2023-10-14 04:29:03,960 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-10-14 04:29:23,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1593064.6666666667, ans=0.125 2023-10-14 04:29:23,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1593064.6666666667, ans=0.125 2023-10-14 04:29:23,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1593064.6666666667, ans=0.2 2023-10-14 04:29:23,863 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.31 vs. limit=15.0 2023-10-14 04:29:29,521 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-25.pt 2023-10-14 04:30:01,011 INFO [train.py:1031] (0/4) Epoch 26, batch 0, loss[loss=0.1537, simple_loss=0.2473, pruned_loss=0.03008, over 16923.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2473, pruned_loss=0.03008, over 16923.00 frames. ], batch size: 138, lr: 1.33e-03, grad_scale: 32.0 2023-10-14 04:30:01,012 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-14 04:30:09,220 INFO [train.py:1063] (0/4) Epoch 26, validation: loss=0.2137, simple_loss=0.3003, pruned_loss=0.06359, over 1020973.00 frames. 2023-10-14 04:30:09,220 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-14 04:30:12,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1593088.0, ans=0.0 2023-10-14 04:30:18,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1593088.0, ans=0.0 2023-10-14 04:30:19,390 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.775e+02 1.928e+02 2.228e+02 3.655e+02, threshold=3.856e+02, percent-clipped=0.0 2023-10-14 04:30:21,575 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-10-14 04:30:23,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1593134.6666666667, ans=0.1 2023-10-14 04:30:27,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1593134.6666666667, ans=0.1 2023-10-14 04:30:30,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1593181.3333333333, ans=0.2 2023-10-14 04:30:55,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1593274.6666666667, ans=0.1 2023-10-14 04:31:12,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1593321.3333333333, ans=0.05 2023-10-14 04:31:15,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1593321.3333333333, ans=0.125 2023-10-14 04:31:15,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1593321.3333333333, ans=0.035 2023-10-14 04:31:19,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1593368.0, ans=0.0 2023-10-14 04:31:39,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1593461.3333333333, ans=0.1 2023-10-14 04:31:58,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1593508.0, ans=0.0 2023-10-14 04:32:12,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.719e+02 1.848e+02 2.028e+02 2.741e+02, threshold=3.697e+02, percent-clipped=0.0 2023-10-14 04:32:23,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1593648.0, ans=0.2 2023-10-14 04:32:28,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1593648.0, ans=0.125 2023-10-14 04:32:39,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1593694.6666666667, ans=0.125 2023-10-14 04:33:03,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1593788.0, ans=0.0 2023-10-14 04:33:20,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1593881.3333333333, ans=0.0 2023-10-14 04:33:26,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1593881.3333333333, ans=0.0 2023-10-14 04:33:30,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1593928.0, ans=0.125 2023-10-14 04:33:59,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.796e+02 1.982e+02 2.152e+02 2.667e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-14 04:34:21,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1594161.3333333333, ans=0.125 2023-10-14 04:34:21,675 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.00 vs. limit=15.0 2023-10-14 04:34:32,948 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-10-14 04:34:48,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1594254.6666666667, ans=0.1 2023-10-14 04:35:13,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1594348.0, ans=0.125 2023-10-14 04:35:19,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1594394.6666666667, ans=0.1 2023-10-14 04:35:47,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1594488.0, ans=0.0 2023-10-14 04:35:51,286 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.741e+02 1.949e+02 2.192e+02 3.714e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-14 04:35:56,232 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=12.0 2023-10-14 04:36:11,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.86 vs. limit=22.5 2023-10-14 04:36:14,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.64 vs. limit=15.0 2023-10-14 04:36:25,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1594674.6666666667, ans=0.125 2023-10-14 04:36:27,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1594674.6666666667, ans=0.1 2023-10-14 04:36:47,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1594768.0, ans=0.125 2023-10-14 04:36:53,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1594768.0, ans=0.125 2023-10-14 04:36:58,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1594814.6666666667, ans=0.2 2023-10-14 04:37:01,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1594814.6666666667, ans=0.1 2023-10-14 04:37:23,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1594908.0, ans=0.0 2023-10-14 04:37:28,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1594954.6666666667, ans=0.125 2023-10-14 04:37:41,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.775e+02 1.950e+02 2.231e+02 3.484e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-14 04:37:49,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1595048.0, ans=0.1 2023-10-14 04:37:59,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1595094.6666666667, ans=0.125 2023-10-14 04:38:24,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1595188.0, ans=0.125 2023-10-14 04:38:38,507 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.75 vs. limit=15.0 2023-10-14 04:38:40,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1595234.6666666667, ans=0.125 2023-10-14 04:39:06,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.85 vs. limit=10.0 2023-10-14 04:39:16,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1595374.6666666667, ans=0.2 2023-10-14 04:39:22,905 INFO [train.py:1031] (0/4) Epoch 26, batch 500, loss[loss=0.1785, simple_loss=0.2756, pruned_loss=0.04067, over 16957.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2788, pruned_loss=0.04703, over 7291211.75 frames. ], batch size: 82, lr: 1.33e-03, grad_scale: 16.0 2023-10-14 04:39:26,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-10-14 04:39:34,248 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.792e+02 1.991e+02 2.238e+02 3.146e+02, threshold=3.983e+02, percent-clipped=0.0 2023-10-14 04:39:37,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1595468.0, ans=0.0 2023-10-14 04:39:42,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1595468.0, ans=0.1 2023-10-14 04:40:06,637 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-10-14 04:41:04,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2023-10-14 04:41:06,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.00 vs. limit=22.5 2023-10-14 04:41:10,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1595841.3333333333, ans=0.125 2023-10-14 04:41:18,148 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.28 vs. limit=22.5 2023-10-14 04:41:24,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.890e+02 2.140e+02 2.348e+02 3.064e+02, threshold=4.280e+02, percent-clipped=0.0 2023-10-14 04:41:34,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1595981.3333333333, ans=0.125 2023-10-14 04:41:41,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1595981.3333333333, ans=0.05 2023-10-14 04:41:47,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1596028.0, ans=0.07 2023-10-14 04:42:03,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.14 vs. limit=22.5 2023-10-14 04:42:22,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1596168.0, ans=0.2 2023-10-14 04:42:24,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1596168.0, ans=0.035 2023-10-14 04:42:25,176 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1596168.0, ans=0.125 2023-10-14 04:42:47,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1596261.3333333333, ans=0.125 2023-10-14 04:42:56,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1596308.0, ans=0.1 2023-10-14 04:43:00,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1596354.6666666667, ans=0.0 2023-10-14 04:43:03,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1596354.6666666667, ans=0.1 2023-10-14 04:43:07,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1596354.6666666667, ans=0.125 2023-10-14 04:43:12,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.819e+02 1.986e+02 2.249e+02 2.926e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-14 04:43:22,116 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:43:48,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1596541.3333333333, ans=0.125 2023-10-14 04:43:49,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1596541.3333333333, ans=0.0 2023-10-14 04:43:51,309 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-10-14 04:43:57,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1596588.0, ans=0.04949747468305833 2023-10-14 04:44:02,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1596588.0, ans=0.0 2023-10-14 04:44:10,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1596634.6666666667, ans=0.125 2023-10-14 04:44:10,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1596634.6666666667, ans=0.0 2023-10-14 04:44:32,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-10-14 04:44:33,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1596728.0, ans=0.125 2023-10-14 04:44:35,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1596728.0, ans=0.2 2023-10-14 04:44:48,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1596774.6666666667, ans=0.125 2023-10-14 04:44:58,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1596821.3333333333, ans=0.125 2023-10-14 04:45:04,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.762e+02 1.887e+02 2.105e+02 3.657e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-14 04:45:10,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1596868.0, ans=0.1 2023-10-14 04:45:37,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1596961.3333333333, ans=0.0 2023-10-14 04:45:50,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1597008.0, ans=0.0 2023-10-14 04:45:56,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.60 vs. limit=15.0 2023-10-14 04:46:03,880 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:46:18,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1597148.0, ans=0.0 2023-10-14 04:46:36,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597241.3333333333, ans=0.1 2023-10-14 04:46:37,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597241.3333333333, ans=0.1 2023-10-14 04:46:43,903 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-10-14 04:46:53,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1597288.0, ans=0.1 2023-10-14 04:46:58,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.844e+02 1.985e+02 2.266e+02 3.163e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-14 04:47:04,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1597334.6666666667, ans=0.1 2023-10-14 04:47:05,471 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.34 vs. limit=22.5 2023-10-14 04:47:18,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1597428.0, ans=0.0 2023-10-14 04:47:25,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1597428.0, ans=0.0 2023-10-14 04:47:32,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1597474.6666666667, ans=0.0 2023-10-14 04:47:36,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.37 vs. limit=10.0 2023-10-14 04:47:41,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1597474.6666666667, ans=0.09899494936611666 2023-10-14 04:47:50,508 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.41 vs. limit=15.0 2023-10-14 04:47:52,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1597521.3333333333, ans=0.125 2023-10-14 04:48:01,222 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:48:07,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1597614.6666666667, ans=0.0 2023-10-14 04:48:07,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.12 vs. limit=10.0 2023-10-14 04:48:30,707 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-10-14 04:48:38,388 INFO [train.py:1031] (0/4) Epoch 26, batch 1000, loss[loss=0.1865, simple_loss=0.2833, pruned_loss=0.04481, over 16852.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2794, pruned_loss=0.04748, over 12928203.06 frames. ], batch size: 146, lr: 1.33e-03, grad_scale: 32.0 2023-10-14 04:48:51,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.721e+02 1.930e+02 2.104e+02 3.145e+02, threshold=3.860e+02, percent-clipped=0.0 2023-10-14 04:48:59,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.19 vs. limit=6.0 2023-10-14 04:49:06,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1597848.0, ans=0.125 2023-10-14 04:49:09,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1597894.6666666667, ans=0.1 2023-10-14 04:49:15,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1597894.6666666667, ans=0.05 2023-10-14 04:49:18,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1597894.6666666667, ans=0.125 2023-10-14 04:49:37,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1597988.0, ans=0.0 2023-10-14 04:49:50,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1598081.3333333333, ans=0.125 2023-10-14 04:49:55,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1598081.3333333333, ans=0.125 2023-10-14 04:49:57,018 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.50 vs. limit=15.0 2023-10-14 04:50:16,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1598174.6666666667, ans=0.0 2023-10-14 04:50:18,265 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-10-14 04:50:22,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1598221.3333333333, ans=0.0 2023-10-14 04:50:24,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1598221.3333333333, ans=0.1 2023-10-14 04:50:37,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.797e+02 1.957e+02 2.116e+02 2.863e+02, threshold=3.915e+02, percent-clipped=0.0 2023-10-14 04:50:41,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=15.0 2023-10-14 04:50:43,780 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-10-14 04:50:50,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1598314.6666666667, ans=0.0 2023-10-14 04:50:52,343 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-10-14 04:50:55,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1598314.6666666667, ans=0.125 2023-10-14 04:51:06,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1598361.3333333333, ans=0.125 2023-10-14 04:51:07,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1598361.3333333333, ans=0.0 2023-10-14 04:51:10,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1598408.0, ans=0.125 2023-10-14 04:51:13,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1598408.0, ans=0.125 2023-10-14 04:51:16,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1598408.0, ans=0.125 2023-10-14 04:51:16,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1598408.0, ans=0.0 2023-10-14 04:51:23,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1598454.6666666667, ans=0.125 2023-10-14 04:52:13,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1598641.3333333333, ans=0.125 2023-10-14 04:52:23,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1598688.0, ans=0.0 2023-10-14 04:52:29,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1598688.0, ans=0.0 2023-10-14 04:52:34,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.745e+02 1.887e+02 2.132e+02 3.249e+02, threshold=3.774e+02, percent-clipped=0.0 2023-10-14 04:52:45,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1598781.3333333333, ans=0.0 2023-10-14 04:52:53,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-10-14 04:52:55,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1598828.0, ans=0.0 2023-10-14 04:53:15,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1598874.6666666667, ans=0.0 2023-10-14 04:53:19,355 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.60 vs. limit=15.0 2023-10-14 04:53:30,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1598968.0, ans=0.07 2023-10-14 04:53:46,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1599014.6666666667, ans=0.1 2023-10-14 04:53:48,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1599014.6666666667, ans=0.125 2023-10-14 04:54:25,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.748e+02 1.911e+02 2.112e+02 3.133e+02, threshold=3.822e+02, percent-clipped=0.0 2023-10-14 04:54:40,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=1599248.0, ans=15.0 2023-10-14 04:54:41,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1599248.0, ans=0.125 2023-10-14 04:54:41,707 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-10-14 04:54:49,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1599294.6666666667, ans=0.0 2023-10-14 04:55:09,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1599388.0, ans=0.0 2023-10-14 04:55:16,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1599434.6666666667, ans=0.2 2023-10-14 04:55:19,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1599434.6666666667, ans=0.0 2023-10-14 04:55:35,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1599481.3333333333, ans=0.125 2023-10-14 04:55:45,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1599528.0, ans=0.2 2023-10-14 04:55:49,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1599574.6666666667, ans=0.125 2023-10-14 04:55:53,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.80 vs. limit=22.5 2023-10-14 04:55:54,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1599574.6666666667, ans=0.0 2023-10-14 04:56:11,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1599621.3333333333, ans=0.0 2023-10-14 04:56:16,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.724e+02 1.860e+02 2.086e+02 3.127e+02, threshold=3.719e+02, percent-clipped=0.0 2023-10-14 04:56:19,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1599668.0, ans=0.0 2023-10-14 04:56:26,586 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:56:30,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1599714.6666666667, ans=0.1 2023-10-14 04:56:38,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1599761.3333333333, ans=0.2 2023-10-14 04:56:41,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1599761.3333333333, ans=0.125 2023-10-14 04:56:41,696 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.63 vs. limit=15.0 2023-10-14 04:56:48,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1599808.0, ans=0.0 2023-10-14 04:56:48,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1599808.0, ans=0.0 2023-10-14 04:57:00,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.76 vs. limit=15.0 2023-10-14 04:57:32,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.97 vs. limit=15.0 2023-10-14 04:57:35,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1599994.6666666667, ans=0.125 2023-10-14 04:57:39,404 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 04:57:57,945 INFO [train.py:1031] (0/4) Epoch 26, batch 1500, loss[loss=0.1881, simple_loss=0.2756, pruned_loss=0.05026, over 16896.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2784, pruned_loss=0.04683, over 17351975.98 frames. ], batch size: 110, lr: 1.33e-03, grad_scale: 16.0 2023-10-14 04:57:59,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1600088.0, ans=0.125 2023-10-14 04:58:00,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1600088.0, ans=0.2 2023-10-14 04:58:07,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1600088.0, ans=0.125 2023-10-14 04:58:10,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1600134.6666666667, ans=0.2 2023-10-14 04:58:13,477 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.839e+02 1.976e+02 2.250e+02 2.759e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-14 04:58:19,380 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.61 vs. limit=15.0 2023-10-14 04:58:36,069 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.27 vs. limit=22.5 2023-10-14 04:58:42,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1600274.6666666667, ans=0.125 2023-10-14 04:58:51,588 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.22 vs. limit=12.0 2023-10-14 04:59:02,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1600321.3333333333, ans=0.0 2023-10-14 04:59:09,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1600368.0, ans=0.125 2023-10-14 04:59:13,900 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2023-10-14 04:59:20,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1600414.6666666667, ans=0.2 2023-10-14 04:59:50,258 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.31 vs. limit=15.0 2023-10-14 04:59:52,394 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.77 vs. limit=12.0 2023-10-14 04:59:55,674 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.30 vs. limit=6.0 2023-10-14 04:59:56,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1600554.6666666667, ans=0.0 2023-10-14 05:00:04,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.759e+02 1.874e+02 2.075e+02 2.983e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-14 05:00:31,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1600694.6666666667, ans=0.125 2023-10-14 05:00:55,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=12.0 2023-10-14 05:01:11,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1600834.6666666667, ans=0.0 2023-10-14 05:01:34,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1600928.0, ans=0.1 2023-10-14 05:01:50,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1601021.3333333333, ans=0.125 2023-10-14 05:02:01,538 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=12.0 2023-10-14 05:02:02,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.805e+02 1.959e+02 2.213e+02 2.892e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-14 05:02:24,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1601161.3333333333, ans=0.0 2023-10-14 05:02:26,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1601161.3333333333, ans=0.0 2023-10-14 05:02:42,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1601254.6666666667, ans=0.0 2023-10-14 05:02:54,401 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-10-14 05:02:59,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1601301.3333333333, ans=0.125 2023-10-14 05:03:09,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1601348.0, ans=0.025 2023-10-14 05:03:25,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1601394.6666666667, ans=0.0 2023-10-14 05:03:36,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1601441.3333333333, ans=0.125 2023-10-14 05:03:40,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1601441.3333333333, ans=0.125 2023-10-14 05:03:56,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1601534.6666666667, ans=0.2 2023-10-14 05:04:00,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1601534.6666666667, ans=0.0 2023-10-14 05:04:01,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.809e+02 1.938e+02 2.187e+02 3.176e+02, threshold=3.876e+02, percent-clipped=0.0 2023-10-14 05:04:01,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1601534.6666666667, ans=0.0 2023-10-14 05:04:48,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1601721.3333333333, ans=0.2 2023-10-14 05:04:49,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1601721.3333333333, ans=0.025 2023-10-14 05:04:58,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1601768.0, ans=0.0 2023-10-14 05:05:07,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1601768.0, ans=0.09899494936611666 2023-10-14 05:05:11,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1601814.6666666667, ans=0.1 2023-10-14 05:05:22,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1601861.3333333333, ans=0.1 2023-10-14 05:05:42,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1601954.6666666667, ans=0.1 2023-10-14 05:05:48,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1601954.6666666667, ans=0.0 2023-10-14 05:05:58,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.829e+02 2.068e+02 2.266e+02 3.103e+02, threshold=4.136e+02, percent-clipped=0.0 2023-10-14 05:06:03,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1602001.3333333333, ans=0.1 2023-10-14 05:06:08,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1602048.0, ans=0.0 2023-10-14 05:06:20,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1602094.6666666667, ans=0.2 2023-10-14 05:06:21,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-10-14 05:06:33,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1602141.3333333333, ans=0.125 2023-10-14 05:07:01,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1602234.6666666667, ans=0.125 2023-10-14 05:07:49,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1602374.6666666667, ans=0.2 2023-10-14 05:07:54,744 INFO [train.py:1031] (0/4) Epoch 26, batch 2000, loss[loss=0.2047, simple_loss=0.2703, pruned_loss=0.06954, over 12794.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.279, pruned_loss=0.04697, over 20774820.64 frames. ], batch size: 440, lr: 1.33e-03, grad_scale: 32.0 2023-10-14 05:08:01,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1602421.3333333333, ans=0.125 2023-10-14 05:08:04,362 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:08:10,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.760e+02 1.951e+02 2.176e+02 3.927e+02, threshold=3.901e+02, percent-clipped=0.0 2023-10-14 05:08:23,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1602514.6666666667, ans=0.125 2023-10-14 05:08:23,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1602514.6666666667, ans=0.125 2023-10-14 05:08:37,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1602561.3333333333, ans=0.0 2023-10-14 05:08:40,409 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.38 vs. limit=15.0 2023-10-14 05:08:57,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1602608.0, ans=0.0 2023-10-14 05:09:39,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1602794.6666666667, ans=0.2 2023-10-14 05:09:59,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1602841.3333333333, ans=0.0 2023-10-14 05:10:02,063 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.49 vs. limit=15.0 2023-10-14 05:10:03,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1602841.3333333333, ans=0.125 2023-10-14 05:10:07,047 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.21 vs. limit=15.0 2023-10-14 05:10:31,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1602934.6666666667, ans=0.2 2023-10-14 05:10:34,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.768e+02 1.956e+02 2.229e+02 3.228e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-14 05:10:47,420 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.87 vs. limit=10.0 2023-10-14 05:10:57,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1602981.3333333333, ans=0.125 2023-10-14 05:11:09,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-10-14 05:11:27,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1603074.6666666667, ans=0.09899494936611666 2023-10-14 05:11:43,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1603121.3333333333, ans=0.0 2023-10-14 05:12:07,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1603214.6666666667, ans=15.0 2023-10-14 05:12:17,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1603261.3333333333, ans=0.1 2023-10-14 05:12:19,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1603261.3333333333, ans=0.0 2023-10-14 05:12:26,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1603308.0, ans=0.125 2023-10-14 05:12:36,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1603354.6666666667, ans=0.0 2023-10-14 05:12:45,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1603354.6666666667, ans=0.125 2023-10-14 05:12:50,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.848e+02 2.000e+02 2.217e+02 3.263e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-14 05:12:51,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1603401.3333333333, ans=0.125 2023-10-14 05:13:17,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1603494.6666666667, ans=0.2 2023-10-14 05:13:19,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.28 vs. limit=15.0 2023-10-14 05:13:24,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1603541.3333333333, ans=0.04949747468305833 2023-10-14 05:13:25,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1603541.3333333333, ans=0.2 2023-10-14 05:13:40,718 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1603588.0, ans=0.0 2023-10-14 05:13:53,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1603634.6666666667, ans=0.125 2023-10-14 05:13:55,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1603634.6666666667, ans=0.125 2023-10-14 05:14:10,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1603728.0, ans=0.125 2023-10-14 05:14:14,844 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.23 vs. limit=6.0 2023-10-14 05:14:42,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1603821.3333333333, ans=0.2 2023-10-14 05:14:47,920 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.852e+02 1.963e+02 2.145e+02 2.907e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-14 05:15:31,790 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-10-14 05:15:36,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1604054.6666666667, ans=0.125 2023-10-14 05:15:40,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1604101.3333333333, ans=0.0 2023-10-14 05:16:08,120 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2023-10-14 05:16:40,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.800e+02 1.946e+02 2.111e+02 3.235e+02, threshold=3.892e+02, percent-clipped=0.0 2023-10-14 05:16:41,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1604334.6666666667, ans=0.125 2023-10-14 05:16:51,603 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:17:14,661 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=15.0 2023-10-14 05:17:37,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1604568.0, ans=0.2 2023-10-14 05:17:44,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1604614.6666666667, ans=0.0 2023-10-14 05:17:44,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1604614.6666666667, ans=0.125 2023-10-14 05:17:58,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1604661.3333333333, ans=0.125 2023-10-14 05:18:05,169 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=22.5 2023-10-14 05:18:08,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1604708.0, ans=0.125 2023-10-14 05:18:15,765 INFO [train.py:1031] (0/4) Epoch 26, batch 2500, loss[loss=0.1895, simple_loss=0.2721, pruned_loss=0.05348, over 16561.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2789, pruned_loss=0.04703, over 23462249.51 frames. ], batch size: 66, lr: 1.33e-03, grad_scale: 32.0 2023-10-14 05:18:25,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1604801.3333333333, ans=0.125 2023-10-14 05:18:29,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.779e+02 1.962e+02 2.164e+02 2.703e+02, threshold=3.923e+02, percent-clipped=0.0 2023-10-14 05:18:30,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1604801.3333333333, ans=0.0 2023-10-14 05:18:37,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1604848.0, ans=0.0 2023-10-14 05:18:55,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1604894.6666666667, ans=0.04949747468305833 2023-10-14 05:19:01,266 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0 2023-10-14 05:19:25,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=1605034.6666666667, ans=0.5 2023-10-14 05:19:29,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1605034.6666666667, ans=0.1 2023-10-14 05:19:31,710 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.86 vs. limit=22.5 2023-10-14 05:19:35,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1605081.3333333333, ans=0.1 2023-10-14 05:20:15,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1605268.0, ans=0.1 2023-10-14 05:20:17,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.71 vs. limit=15.0 2023-10-14 05:20:18,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1605268.0, ans=0.125 2023-10-14 05:20:21,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.857e+02 2.016e+02 2.266e+02 3.448e+02, threshold=4.031e+02, percent-clipped=0.0 2023-10-14 05:20:32,414 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-344000.pt 2023-10-14 05:20:37,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1605314.6666666667, ans=0.125 2023-10-14 05:20:40,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1605314.6666666667, ans=0.1 2023-10-14 05:20:42,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1605314.6666666667, ans=0.0 2023-10-14 05:20:46,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1605361.3333333333, ans=0.04949747468305833 2023-10-14 05:21:04,521 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=12.0 2023-10-14 05:21:09,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1605454.6666666667, ans=0.125 2023-10-14 05:21:35,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1605548.0, ans=0.125 2023-10-14 05:21:46,503 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.13 vs. limit=15.0 2023-10-14 05:21:55,457 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1605641.3333333333, ans=0.1 2023-10-14 05:22:19,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1605734.6666666667, ans=0.0 2023-10-14 05:22:22,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1605734.6666666667, ans=10.0 2023-10-14 05:22:23,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.803e+02 1.932e+02 2.177e+02 3.401e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-14 05:22:23,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1605734.6666666667, ans=0.1 2023-10-14 05:22:25,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1605734.6666666667, ans=0.125 2023-10-14 05:23:27,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1605968.0, ans=0.07 2023-10-14 05:24:01,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1606108.0, ans=0.0 2023-10-14 05:24:06,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1606108.0, ans=0.125 2023-10-14 05:24:19,695 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.28 vs. limit=22.5 2023-10-14 05:24:26,321 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.762e+02 1.984e+02 2.143e+02 2.899e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-14 05:24:28,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1606201.3333333333, ans=0.0 2023-10-14 05:24:31,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1606248.0, ans=0.125 2023-10-14 05:24:35,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1606248.0, ans=0.2 2023-10-14 05:24:35,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1606248.0, ans=0.07 2023-10-14 05:24:48,096 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-10-14 05:24:55,012 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.56 vs. limit=15.0 2023-10-14 05:24:59,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1606341.3333333333, ans=0.0 2023-10-14 05:25:12,141 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-10-14 05:25:18,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1606388.0, ans=0.05 2023-10-14 05:25:25,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1606434.6666666667, ans=0.0 2023-10-14 05:25:29,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1606434.6666666667, ans=0.125 2023-10-14 05:25:36,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1606434.6666666667, ans=0.125 2023-10-14 05:25:49,760 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1606481.3333333333, ans=0.125 2023-10-14 05:25:52,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1606528.0, ans=0.0 2023-10-14 05:26:07,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-14 05:26:17,737 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2023-10-14 05:26:29,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1606621.3333333333, ans=0.125 2023-10-14 05:26:31,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1606621.3333333333, ans=0.0 2023-10-14 05:26:39,664 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.826e+02 1.972e+02 2.150e+02 3.040e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-14 05:26:42,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.76 vs. limit=15.0 2023-10-14 05:26:46,873 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:26:53,525 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=12.0 2023-10-14 05:27:18,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1606854.6666666667, ans=0.0 2023-10-14 05:27:30,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1606901.3333333333, ans=0.125 2023-10-14 05:27:37,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1606901.3333333333, ans=0.1 2023-10-14 05:27:53,535 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.93 vs. limit=15.0 2023-10-14 05:27:55,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1606994.6666666667, ans=0.125 2023-10-14 05:28:01,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1607041.3333333333, ans=0.015 2023-10-14 05:28:08,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1607041.3333333333, ans=0.0 2023-10-14 05:28:13,049 INFO [train.py:1031] (0/4) Epoch 26, batch 3000, loss[loss=0.1699, simple_loss=0.2689, pruned_loss=0.03547, over 16857.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2782, pruned_loss=0.04709, over 25517790.21 frames. ], batch size: 93, lr: 1.33e-03, grad_scale: 16.0 2023-10-14 05:28:29,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.98 vs. limit=10.0 2023-10-14 05:28:30,254 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.809e+02 1.948e+02 2.228e+02 4.084e+02, threshold=3.896e+02, percent-clipped=1.0 2023-10-14 05:28:48,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1607228.0, ans=0.125 2023-10-14 05:28:53,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1607228.0, ans=0.1 2023-10-14 05:28:54,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1607228.0, ans=0.0 2023-10-14 05:29:12,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1607321.3333333333, ans=0.1 2023-10-14 05:29:32,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1607414.6666666667, ans=0.125 2023-10-14 05:29:32,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1607414.6666666667, ans=0.125 2023-10-14 05:29:34,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1607414.6666666667, ans=0.1 2023-10-14 05:29:58,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1607508.0, ans=0.2 2023-10-14 05:30:00,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1607508.0, ans=0.07 2023-10-14 05:30:23,017 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-10-14 05:30:24,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1607601.3333333333, ans=0.1 2023-10-14 05:30:26,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.843e+02 1.966e+02 2.171e+02 2.891e+02, threshold=3.932e+02, percent-clipped=0.0 2023-10-14 05:30:30,782 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=15.0 2023-10-14 05:30:43,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1607694.6666666667, ans=0.2 2023-10-14 05:30:55,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1607741.3333333333, ans=0.125 2023-10-14 05:31:07,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1607788.0, ans=0.0 2023-10-14 05:31:08,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1607788.0, ans=0.125 2023-10-14 05:31:30,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.44 vs. limit=15.0 2023-10-14 05:31:35,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1607881.3333333333, ans=0.1 2023-10-14 05:32:19,221 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.806e+02 1.986e+02 2.217e+02 2.889e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-14 05:32:32,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1608114.6666666667, ans=0.125 2023-10-14 05:32:49,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608161.3333333333, ans=0.1 2023-10-14 05:33:19,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1608301.3333333333, ans=0.125 2023-10-14 05:33:42,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1608394.6666666667, ans=0.2 2023-10-14 05:33:46,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=1608394.6666666667, ans=15.0 2023-10-14 05:34:05,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608488.0, ans=0.1 2023-10-14 05:34:21,071 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.802e+02 1.912e+02 2.095e+02 3.783e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-14 05:34:25,357 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.65 vs. limit=15.0 2023-10-14 05:34:32,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1608581.3333333333, ans=0.125 2023-10-14 05:34:40,441 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=22.5 2023-10-14 05:34:40,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.01 vs. limit=15.0 2023-10-14 05:34:51,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1608674.6666666667, ans=0.125 2023-10-14 05:34:55,182 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-10-14 05:35:01,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1608721.3333333333, ans=0.0 2023-10-14 05:35:01,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1608721.3333333333, ans=0.09899494936611666 2023-10-14 05:35:02,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.11 vs. limit=22.5 2023-10-14 05:35:07,929 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.01 vs. limit=15.0 2023-10-14 05:35:15,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1608768.0, ans=0.125 2023-10-14 05:35:22,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1608814.6666666667, ans=0.1 2023-10-14 05:35:28,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1608814.6666666667, ans=0.1 2023-10-14 05:35:31,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1608814.6666666667, ans=0.125 2023-10-14 05:35:50,034 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.52 vs. limit=15.0 2023-10-14 05:35:59,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1608954.6666666667, ans=0.2 2023-10-14 05:36:16,563 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.905e+02 2.066e+02 2.263e+02 3.148e+02, threshold=4.132e+02, percent-clipped=0.0 2023-10-14 05:36:47,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1609141.3333333333, ans=0.0 2023-10-14 05:36:56,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1609188.0, ans=0.04949747468305833 2023-10-14 05:37:04,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1609188.0, ans=0.1 2023-10-14 05:37:06,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1609234.6666666667, ans=0.0 2023-10-14 05:37:22,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1609281.3333333333, ans=0.0 2023-10-14 05:37:22,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1609281.3333333333, ans=0.0 2023-10-14 05:37:24,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1609281.3333333333, ans=0.2 2023-10-14 05:37:24,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1609281.3333333333, ans=0.1 2023-10-14 05:37:25,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1609328.0, ans=0.125 2023-10-14 05:37:26,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1609328.0, ans=0.1 2023-10-14 05:37:29,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.85 vs. limit=15.0 2023-10-14 05:37:42,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-10-14 05:37:50,240 INFO [train.py:1031] (0/4) Epoch 26, batch 3500, loss[loss=0.1961, simple_loss=0.287, pruned_loss=0.05258, over 15927.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2781, pruned_loss=0.04704, over 27136011.89 frames. ], batch size: 43, lr: 1.33e-03, grad_scale: 16.0 2023-10-14 05:37:55,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1609421.3333333333, ans=0.125 2023-10-14 05:38:08,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.802e+02 1.970e+02 2.146e+02 3.005e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-14 05:38:13,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1609514.6666666667, ans=0.2 2023-10-14 05:38:28,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1609561.3333333333, ans=0.125 2023-10-14 05:38:30,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1609561.3333333333, ans=0.125 2023-10-14 05:38:39,132 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-10-14 05:38:41,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1609608.0, ans=0.125 2023-10-14 05:39:14,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1609748.0, ans=0.125 2023-10-14 05:39:23,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1609748.0, ans=0.0 2023-10-14 05:39:33,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1609794.6666666667, ans=0.1 2023-10-14 05:39:49,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1609888.0, ans=0.0 2023-10-14 05:39:57,799 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:39:58,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1609888.0, ans=0.0 2023-10-14 05:39:59,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1609888.0, ans=0.125 2023-10-14 05:40:06,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1609934.6666666667, ans=0.125 2023-10-14 05:40:08,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.826e+02 1.981e+02 2.215e+02 2.880e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-14 05:40:11,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1609981.3333333333, ans=0.025 2023-10-14 05:40:14,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1609981.3333333333, ans=0.125 2023-10-14 05:40:20,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1609981.3333333333, ans=0.125 2023-10-14 05:40:31,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1610028.0, ans=0.1 2023-10-14 05:40:32,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1610028.0, ans=0.1 2023-10-14 05:40:55,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1610121.3333333333, ans=0.2 2023-10-14 05:41:05,420 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.81 vs. limit=15.0 2023-10-14 05:41:13,805 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.70 vs. limit=22.5 2023-10-14 05:41:14,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1610214.6666666667, ans=0.125 2023-10-14 05:41:51,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1610354.6666666667, ans=0.2 2023-10-14 05:41:58,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1610401.3333333333, ans=0.125 2023-10-14 05:42:01,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.756e+02 1.897e+02 2.113e+02 2.615e+02, threshold=3.794e+02, percent-clipped=0.0 2023-10-14 05:42:24,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1610494.6666666667, ans=10.0 2023-10-14 05:42:26,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1610494.6666666667, ans=0.0 2023-10-14 05:42:34,039 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.82 vs. limit=22.5 2023-10-14 05:42:37,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1610541.3333333333, ans=0.1 2023-10-14 05:42:38,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1610541.3333333333, ans=0.0 2023-10-14 05:42:40,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1610541.3333333333, ans=0.0 2023-10-14 05:42:42,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1610588.0, ans=0.125 2023-10-14 05:43:15,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1610681.3333333333, ans=0.125 2023-10-14 05:44:03,140 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.794e+02 1.991e+02 2.183e+02 3.178e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-14 05:44:37,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1611008.0, ans=0.125 2023-10-14 05:44:44,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1611054.6666666667, ans=0.09899494936611666 2023-10-14 05:44:44,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-10-14 05:44:50,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1611054.6666666667, ans=0.2 2023-10-14 05:44:51,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1611101.3333333333, ans=0.0 2023-10-14 05:45:02,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1611148.0, ans=0.025 2023-10-14 05:45:07,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1611148.0, ans=0.125 2023-10-14 05:45:19,051 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-10-14 05:45:27,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1611241.3333333333, ans=0.125 2023-10-14 05:45:43,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1611288.0, ans=0.125 2023-10-14 05:45:53,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.692e+02 1.840e+02 2.064e+02 2.845e+02, threshold=3.680e+02, percent-clipped=0.0 2023-10-14 05:46:03,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1611381.3333333333, ans=0.2 2023-10-14 05:46:04,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1611381.3333333333, ans=0.2 2023-10-14 05:46:42,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1611568.0, ans=0.1 2023-10-14 05:46:49,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1611568.0, ans=0.0 2023-10-14 05:47:28,196 INFO [train.py:1031] (0/4) Epoch 26, batch 4000, loss[loss=0.18, simple_loss=0.282, pruned_loss=0.03894, over 16886.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2779, pruned_loss=0.04717, over 28393180.58 frames. ], batch size: 104, lr: 1.33e-03, grad_scale: 32.0 2023-10-14 05:47:49,502 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.814e+02 1.963e+02 2.108e+02 3.118e+02, threshold=3.927e+02, percent-clipped=0.0 2023-10-14 05:47:51,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1611801.3333333333, ans=0.2 2023-10-14 05:47:57,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1611848.0, ans=0.125 2023-10-14 05:48:08,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1611894.6666666667, ans=0.0 2023-10-14 05:48:15,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.56 vs. limit=15.0 2023-10-14 05:48:36,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1612034.6666666667, ans=0.1 2023-10-14 05:48:55,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1612081.3333333333, ans=0.0 2023-10-14 05:48:56,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1612081.3333333333, ans=0.04949747468305833 2023-10-14 05:48:59,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1612128.0, ans=0.0 2023-10-14 05:49:21,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1612221.3333333333, ans=0.125 2023-10-14 05:49:39,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.852e+02 1.947e+02 2.087e+02 2.679e+02, threshold=3.894e+02, percent-clipped=0.0 2023-10-14 05:50:05,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1612408.0, ans=0.125 2023-10-14 05:50:52,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1612548.0, ans=0.125 2023-10-14 05:51:19,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1612641.3333333333, ans=0.1 2023-10-14 05:51:21,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1612641.3333333333, ans=0.125 2023-10-14 05:51:38,318 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=12.0 2023-10-14 05:51:40,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1612734.6666666667, ans=0.0 2023-10-14 05:51:47,173 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-14 05:51:47,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.774e+02 1.918e+02 2.120e+02 2.945e+02, threshold=3.835e+02, percent-clipped=0.0 2023-10-14 05:52:08,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1612828.0, ans=0.125 2023-10-14 05:52:14,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2023-10-14 05:52:36,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1612968.0, ans=0.0 2023-10-14 05:52:58,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1613061.3333333333, ans=0.125 2023-10-14 05:52:59,234 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-10-14 05:53:00,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1613061.3333333333, ans=0.0 2023-10-14 05:53:03,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1613061.3333333333, ans=0.0 2023-10-14 05:53:09,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1613108.0, ans=0.0 2023-10-14 05:53:21,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1613154.6666666667, ans=10.0 2023-10-14 05:53:29,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1613201.3333333333, ans=0.04949747468305833 2023-10-14 05:53:39,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1613201.3333333333, ans=0.125 2023-10-14 05:53:40,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.764e+02 1.956e+02 2.128e+02 2.788e+02, threshold=3.912e+02, percent-clipped=0.0 2023-10-14 05:53:40,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1613201.3333333333, ans=0.125 2023-10-14 05:53:48,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1613248.0, ans=0.0 2023-10-14 05:53:51,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1613248.0, ans=0.0 2023-10-14 05:54:00,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1613294.6666666667, ans=0.125 2023-10-14 05:54:46,984 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1613481.3333333333, ans=0.2 2023-10-14 05:54:49,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1613528.0, ans=0.0 2023-10-14 05:54:54,113 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-10-14 05:54:54,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1613528.0, ans=0.0 2023-10-14 05:55:07,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1613574.6666666667, ans=0.1 2023-10-14 05:55:15,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=1613621.3333333333, ans=0.025 2023-10-14 05:55:16,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1613621.3333333333, ans=0.1 2023-10-14 05:55:17,123 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.73 vs. limit=22.5 2023-10-14 05:55:27,899 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:55:32,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 2.022e+02 2.203e+02 2.357e+02 3.298e+02, threshold=4.406e+02, percent-clipped=0.0 2023-10-14 05:55:40,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1613714.6666666667, ans=0.125 2023-10-14 05:55:43,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1613714.6666666667, ans=0.0 2023-10-14 05:55:43,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1613714.6666666667, ans=0.125 2023-10-14 05:55:53,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1613761.3333333333, ans=0.125 2023-10-14 05:56:24,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.06 vs. limit=15.0 2023-10-14 05:56:26,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1613854.6666666667, ans=0.0 2023-10-14 05:56:29,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1613901.3333333333, ans=0.125 2023-10-14 05:56:29,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-10-14 05:56:33,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1613901.3333333333, ans=0.125 2023-10-14 05:56:39,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1613948.0, ans=0.04949747468305833 2023-10-14 05:57:14,456 INFO [train.py:1031] (0/4) Epoch 26, batch 4500, loss[loss=0.184, simple_loss=0.2755, pruned_loss=0.04622, over 16622.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2782, pruned_loss=0.04711, over 29351107.20 frames. ], batch size: 241, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 05:57:31,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1614134.6666666667, ans=0.125 2023-10-14 05:57:34,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.790e+02 1.968e+02 2.286e+02 3.128e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-14 05:58:13,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1614321.3333333333, ans=0.035 2023-10-14 05:58:14,707 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:58:18,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1614368.0, ans=0.125 2023-10-14 05:58:47,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1614461.3333333333, ans=0.05 2023-10-14 05:58:52,828 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:59:10,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1614554.6666666667, ans=0.0 2023-10-14 05:59:19,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1614601.3333333333, ans=0.125 2023-10-14 05:59:21,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.832e+02 1.946e+02 2.212e+02 3.184e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-14 05:59:24,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1614648.0, ans=0.0 2023-10-14 05:59:40,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1614694.6666666667, ans=0.125 2023-10-14 05:59:41,823 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 05:59:45,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.57 vs. limit=5.0 2023-10-14 06:00:17,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1614881.3333333333, ans=0.125 2023-10-14 06:00:31,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1614928.0, ans=0.0 2023-10-14 06:01:10,733 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.815e+02 1.977e+02 2.204e+02 3.336e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-14 06:01:16,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1615114.6666666667, ans=0.125 2023-10-14 06:01:24,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1615161.3333333333, ans=0.125 2023-10-14 06:01:31,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1615161.3333333333, ans=0.04949747468305833 2023-10-14 06:01:56,926 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:02:02,090 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.03 vs. limit=12.0 2023-10-14 06:02:18,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1615394.6666666667, ans=0.0 2023-10-14 06:02:39,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1615488.0, ans=0.125 2023-10-14 06:02:46,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1615488.0, ans=0.95 2023-10-14 06:03:05,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.814e+02 2.000e+02 2.217e+02 2.949e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-14 06:03:12,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1615581.3333333333, ans=0.125 2023-10-14 06:03:23,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1615628.0, ans=0.125 2023-10-14 06:03:30,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1615674.6666666667, ans=0.125 2023-10-14 06:03:41,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1615674.6666666667, ans=0.125 2023-10-14 06:04:17,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1615861.3333333333, ans=0.125 2023-10-14 06:04:30,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1615908.0, ans=0.125 2023-10-14 06:04:59,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.767e+02 1.917e+02 2.168e+02 3.169e+02, threshold=3.834e+02, percent-clipped=0.0 2023-10-14 06:05:23,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1616094.6666666667, ans=0.0 2023-10-14 06:05:27,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1616141.3333333333, ans=0.125 2023-10-14 06:05:49,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-10-14 06:06:01,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1616281.3333333333, ans=0.125 2023-10-14 06:06:08,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1616328.0, ans=0.2 2023-10-14 06:06:14,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1616328.0, ans=0.125 2023-10-14 06:06:18,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1616328.0, ans=0.0 2023-10-14 06:06:26,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1616374.6666666667, ans=0.0 2023-10-14 06:06:31,392 INFO [train.py:1031] (0/4) Epoch 26, batch 5000, loss[loss=0.1937, simple_loss=0.2874, pruned_loss=0.05001, over 16701.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2778, pruned_loss=0.04716, over 30102986.81 frames. ], batch size: 202, lr: 1.32e-03, grad_scale: 16.0 2023-10-14 06:06:41,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1616421.3333333333, ans=0.125 2023-10-14 06:06:41,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.20 vs. limit=22.5 2023-10-14 06:06:42,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1616421.3333333333, ans=0.1 2023-10-14 06:06:48,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1616468.0, ans=0.1 2023-10-14 06:06:55,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.892e+02 2.053e+02 2.221e+02 2.952e+02, threshold=4.106e+02, percent-clipped=0.0 2023-10-14 06:07:00,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1616514.6666666667, ans=0.0 2023-10-14 06:07:11,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.81 vs. limit=15.0 2023-10-14 06:07:17,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1616608.0, ans=0.1 2023-10-14 06:07:40,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=1616701.3333333333, ans=0.95 2023-10-14 06:07:46,067 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:08:34,561 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.49 vs. limit=22.5 2023-10-14 06:08:42,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1616934.6666666667, ans=0.0 2023-10-14 06:08:42,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1616934.6666666667, ans=0.1 2023-10-14 06:08:48,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1616981.3333333333, ans=0.125 2023-10-14 06:08:48,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.46 vs. limit=15.0 2023-10-14 06:08:49,602 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.799e+02 1.949e+02 2.163e+02 2.934e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-14 06:08:53,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1616981.3333333333, ans=0.125 2023-10-14 06:08:55,200 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.83 vs. limit=15.0 2023-10-14 06:08:56,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=1616981.3333333333, ans=0.5 2023-10-14 06:09:20,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1617074.6666666667, ans=0.2 2023-10-14 06:10:04,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1617261.3333333333, ans=0.0 2023-10-14 06:10:27,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1617401.3333333333, ans=0.0 2023-10-14 06:10:38,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.924e+02 2.155e+02 2.412e+02 3.044e+02, threshold=4.311e+02, percent-clipped=0.0 2023-10-14 06:10:49,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1617494.6666666667, ans=0.2 2023-10-14 06:11:10,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1617541.3333333333, ans=0.0 2023-10-14 06:11:27,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1617588.0, ans=0.0 2023-10-14 06:11:27,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1617588.0, ans=0.2 2023-10-14 06:11:43,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1617681.3333333333, ans=0.125 2023-10-14 06:11:47,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1617681.3333333333, ans=0.025 2023-10-14 06:11:51,915 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.88 vs. limit=22.5 2023-10-14 06:11:55,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1617728.0, ans=0.125 2023-10-14 06:11:58,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1617728.0, ans=0.125 2023-10-14 06:12:01,404 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1617728.0, ans=0.0 2023-10-14 06:12:01,705 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.89 vs. limit=10.0 2023-10-14 06:12:21,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1617821.3333333333, ans=0.125 2023-10-14 06:12:23,954 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.65 vs. limit=15.0 2023-10-14 06:12:25,073 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.35 vs. limit=15.0 2023-10-14 06:12:40,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.716e+02 1.866e+02 2.062e+02 3.100e+02, threshold=3.732e+02, percent-clipped=0.0 2023-10-14 06:13:22,691 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:13:24,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618054.6666666667, ans=0.1 2023-10-14 06:13:36,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=22.5 2023-10-14 06:13:40,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1618148.0, ans=0.04949747468305833 2023-10-14 06:13:57,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1618194.6666666667, ans=0.0 2023-10-14 06:14:06,612 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.42 vs. limit=15.0 2023-10-14 06:14:23,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1618334.6666666667, ans=0.2 2023-10-14 06:14:31,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=22.5 2023-10-14 06:14:32,381 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.55 vs. limit=22.5 2023-10-14 06:14:33,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.697e+02 1.896e+02 2.092e+02 2.963e+02, threshold=3.792e+02, percent-clipped=0.0 2023-10-14 06:14:39,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1618381.3333333333, ans=0.1 2023-10-14 06:14:46,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1618428.0, ans=0.0 2023-10-14 06:15:28,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1618568.0, ans=0.2 2023-10-14 06:15:31,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-10-14 06:15:35,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1618614.6666666667, ans=0.125 2023-10-14 06:15:58,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1618708.0, ans=0.2 2023-10-14 06:16:00,482 INFO [train.py:1031] (0/4) Epoch 26, batch 5500, loss[loss=0.1991, simple_loss=0.2891, pruned_loss=0.05449, over 15747.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2778, pruned_loss=0.04712, over 30687191.28 frames. ], batch size: 35, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 06:16:05,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1618754.6666666667, ans=0.125 2023-10-14 06:16:06,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1618754.6666666667, ans=0.0 2023-10-14 06:16:08,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1618754.6666666667, ans=0.0 2023-10-14 06:16:12,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1618801.3333333333, ans=0.125 2023-10-14 06:16:16,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1618801.3333333333, ans=0.0 2023-10-14 06:16:20,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.803e+02 1.924e+02 2.139e+02 2.974e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-14 06:16:36,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1618894.6666666667, ans=0.0 2023-10-14 06:16:59,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1618988.0, ans=0.125 2023-10-14 06:17:28,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1619128.0, ans=0.125 2023-10-14 06:17:40,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1619174.6666666667, ans=0.2 2023-10-14 06:18:08,752 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.888e+02 2.132e+02 2.477e+02 4.389e+02, threshold=4.263e+02, percent-clipped=2.0 2023-10-14 06:18:16,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1619314.6666666667, ans=0.0 2023-10-14 06:18:25,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1619361.3333333333, ans=0.125 2023-10-14 06:18:36,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1619408.0, ans=0.0 2023-10-14 06:18:49,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1619454.6666666667, ans=0.125 2023-10-14 06:18:57,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1619501.3333333333, ans=0.5 2023-10-14 06:19:21,703 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.99 vs. limit=12.0 2023-10-14 06:19:26,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1619594.6666666667, ans=0.95 2023-10-14 06:19:30,540 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:19:37,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1619641.3333333333, ans=0.1 2023-10-14 06:19:52,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1619734.6666666667, ans=0.125 2023-10-14 06:20:04,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1619781.3333333333, ans=0.0 2023-10-14 06:20:06,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.896e+02 2.039e+02 2.267e+02 3.096e+02, threshold=4.078e+02, percent-clipped=0.0 2023-10-14 06:20:50,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1619968.0, ans=0.125 2023-10-14 06:20:56,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1619968.0, ans=0.1 2023-10-14 06:21:00,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1620014.6666666667, ans=0.2 2023-10-14 06:21:02,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1620014.6666666667, ans=0.125 2023-10-14 06:21:11,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1620014.6666666667, ans=0.0 2023-10-14 06:21:12,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1620061.3333333333, ans=0.125 2023-10-14 06:21:17,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1620061.3333333333, ans=0.125 2023-10-14 06:21:47,037 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.77 vs. limit=22.5 2023-10-14 06:22:03,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.794e+02 1.935e+02 2.097e+02 2.642e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-14 06:22:16,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.56 vs. limit=15.0 2023-10-14 06:22:38,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1620388.0, ans=0.125 2023-10-14 06:22:47,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1620434.6666666667, ans=0.125 2023-10-14 06:23:13,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1620528.0, ans=0.0 2023-10-14 06:23:31,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1620621.3333333333, ans=0.125 2023-10-14 06:23:43,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1620668.0, ans=0.0 2023-10-14 06:23:53,399 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.862e+02 2.022e+02 2.235e+02 3.358e+02, threshold=4.044e+02, percent-clipped=0.0 2023-10-14 06:24:04,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1620761.3333333333, ans=10.0 2023-10-14 06:24:21,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1620808.0, ans=0.125 2023-10-14 06:24:23,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1620808.0, ans=0.125 2023-10-14 06:24:31,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1620854.6666666667, ans=0.0 2023-10-14 06:24:38,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1620901.3333333333, ans=0.1 2023-10-14 06:25:08,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1620994.6666666667, ans=0.1 2023-10-14 06:25:23,352 INFO [train.py:1031] (0/4) Epoch 26, batch 6000, loss[loss=0.1751, simple_loss=0.2692, pruned_loss=0.04049, over 16834.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.278, pruned_loss=0.0473, over 31122063.34 frames. ], batch size: 87, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 06:25:28,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1621088.0, ans=0.125 2023-10-14 06:25:31,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1621088.0, ans=0.0 2023-10-14 06:25:34,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1621134.6666666667, ans=0.0 2023-10-14 06:25:46,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.875e+02 2.056e+02 2.258e+02 3.497e+02, threshold=4.111e+02, percent-clipped=0.0 2023-10-14 06:25:52,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1621181.3333333333, ans=0.1 2023-10-14 06:25:53,268 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.20 vs. limit=10.0 2023-10-14 06:25:58,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1621228.0, ans=0.125 2023-10-14 06:26:19,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1621321.3333333333, ans=0.5 2023-10-14 06:26:23,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1621321.3333333333, ans=0.125 2023-10-14 06:26:24,780 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-10-14 06:26:35,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.53 vs. limit=15.0 2023-10-14 06:26:36,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1621368.0, ans=0.5 2023-10-14 06:26:41,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1621414.6666666667, ans=0.07 2023-10-14 06:27:32,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1621601.3333333333, ans=0.125 2023-10-14 06:27:37,000 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.796e+02 1.924e+02 2.109e+02 2.790e+02, threshold=3.848e+02, percent-clipped=0.0 2023-10-14 06:27:53,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1621694.6666666667, ans=0.0 2023-10-14 06:28:26,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1621834.6666666667, ans=0.0 2023-10-14 06:28:36,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1621881.3333333333, ans=0.2 2023-10-14 06:28:41,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.42 vs. limit=15.0 2023-10-14 06:28:50,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1621928.0, ans=0.125 2023-10-14 06:29:15,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1622021.3333333333, ans=0.2 2023-10-14 06:29:16,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1622068.0, ans=0.125 2023-10-14 06:29:22,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1622068.0, ans=0.07 2023-10-14 06:29:24,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1622068.0, ans=0.0 2023-10-14 06:29:31,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.878e+02 2.080e+02 2.310e+02 3.226e+02, threshold=4.161e+02, percent-clipped=0.0 2023-10-14 06:29:43,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1622161.3333333333, ans=0.2 2023-10-14 06:29:58,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1622208.0, ans=0.125 2023-10-14 06:30:13,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1622301.3333333333, ans=0.2 2023-10-14 06:30:19,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1622301.3333333333, ans=0.125 2023-10-14 06:30:38,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622394.6666666667, ans=0.1 2023-10-14 06:31:01,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1622488.0, ans=0.125 2023-10-14 06:31:10,808 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=22.5 2023-10-14 06:31:12,174 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-10-14 06:31:21,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1622581.3333333333, ans=0.125 2023-10-14 06:31:25,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.919e+02 2.102e+02 2.327e+02 2.844e+02, threshold=4.204e+02, percent-clipped=0.0 2023-10-14 06:31:38,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-10-14 06:31:39,794 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.05 vs. limit=15.0 2023-10-14 06:31:41,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1622628.0, ans=0.125 2023-10-14 06:31:52,136 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.44 vs. limit=22.5 2023-10-14 06:31:55,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.38 vs. limit=15.0 2023-10-14 06:32:04,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1622721.3333333333, ans=0.0 2023-10-14 06:32:05,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622721.3333333333, ans=0.1 2023-10-14 06:32:23,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1622768.0, ans=0.125 2023-10-14 06:32:34,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1622814.6666666667, ans=0.125 2023-10-14 06:32:35,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1622814.6666666667, ans=0.07 2023-10-14 06:32:57,637 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=22.5 2023-10-14 06:33:11,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1623001.3333333333, ans=0.125 2023-10-14 06:33:20,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1623001.3333333333, ans=0.2 2023-10-14 06:33:20,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1623001.3333333333, ans=10.0 2023-10-14 06:33:24,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.749e+02 1.932e+02 2.162e+02 3.367e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-14 06:33:30,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1623048.0, ans=0.125 2023-10-14 06:33:31,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1623094.6666666667, ans=0.2 2023-10-14 06:33:32,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1623094.6666666667, ans=0.1 2023-10-14 06:33:47,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1623141.3333333333, ans=0.0 2023-10-14 06:34:03,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1623234.6666666667, ans=0.2 2023-10-14 06:34:11,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1623234.6666666667, ans=0.0 2023-10-14 06:34:52,089 INFO [train.py:1031] (0/4) Epoch 26, batch 6500, loss[loss=0.1795, simple_loss=0.2702, pruned_loss=0.04437, over 16484.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2785, pruned_loss=0.04726, over 31500687.93 frames. ], batch size: 50, lr: 1.32e-03, grad_scale: 16.0 2023-10-14 06:34:53,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1623421.3333333333, ans=0.0 2023-10-14 06:35:00,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1623421.3333333333, ans=10.0 2023-10-14 06:35:09,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1623468.0, ans=0.125 2023-10-14 06:35:23,114 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.852e+02 2.047e+02 2.249e+02 2.927e+02, threshold=4.094e+02, percent-clipped=0.0 2023-10-14 06:35:27,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1623514.6666666667, ans=0.125 2023-10-14 06:35:42,816 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-10-14 06:36:04,879 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1623654.6666666667, ans=0.125 2023-10-14 06:36:04,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1623654.6666666667, ans=0.0 2023-10-14 06:36:08,242 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-10-14 06:36:18,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1623701.3333333333, ans=0.0 2023-10-14 06:36:32,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1623794.6666666667, ans=0.0 2023-10-14 06:36:37,915 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-10-14 06:36:54,762 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-10-14 06:37:00,341 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.99 vs. limit=12.0 2023-10-14 06:37:13,967 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.95 vs. limit=6.0 2023-10-14 06:37:19,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1623981.3333333333, ans=0.07 2023-10-14 06:37:22,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.833e+02 2.027e+02 2.266e+02 3.623e+02, threshold=4.054e+02, percent-clipped=0.0 2023-10-14 06:37:52,361 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.76 vs. limit=15.0 2023-10-14 06:37:56,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1624121.3333333333, ans=0.125 2023-10-14 06:37:59,629 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.10 vs. limit=15.0 2023-10-14 06:38:01,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1624168.0, ans=0.125 2023-10-14 06:38:07,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1624168.0, ans=0.125 2023-10-14 06:38:08,810 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:38:11,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1624214.6666666667, ans=0.07 2023-10-14 06:38:31,399 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-10-14 06:38:39,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1624308.0, ans=0.125 2023-10-14 06:38:50,474 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.77 vs. limit=22.5 2023-10-14 06:39:08,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.801e+02 1.976e+02 2.163e+02 3.359e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-14 06:39:24,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1624494.6666666667, ans=0.2 2023-10-14 06:39:26,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1624541.3333333333, ans=0.125 2023-10-14 06:39:27,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1624541.3333333333, ans=0.125 2023-10-14 06:39:49,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1624588.0, ans=15.0 2023-10-14 06:40:03,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1624681.3333333333, ans=0.0 2023-10-14 06:40:20,707 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-10-14 06:40:30,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1624774.6666666667, ans=0.1 2023-10-14 06:40:35,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1624774.6666666667, ans=0.1 2023-10-14 06:40:36,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1624774.6666666667, ans=0.0 2023-10-14 06:40:47,044 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-10-14 06:40:51,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1624821.3333333333, ans=0.2 2023-10-14 06:41:18,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1624914.6666666667, ans=0.0 2023-10-14 06:41:19,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.814e+02 1.967e+02 2.232e+02 3.554e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-14 06:41:21,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.71 vs. limit=15.0 2023-10-14 06:41:24,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1624961.3333333333, ans=0.125 2023-10-14 06:41:29,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1624961.3333333333, ans=0.125 2023-10-14 06:41:33,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1624961.3333333333, ans=0.125 2023-10-14 06:41:47,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1625008.0, ans=0.2 2023-10-14 06:41:47,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1625008.0, ans=0.2 2023-10-14 06:41:55,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1625054.6666666667, ans=0.2 2023-10-14 06:42:12,078 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.84 vs. limit=6.0 2023-10-14 06:42:16,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1625148.0, ans=0.1 2023-10-14 06:42:19,829 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-10-14 06:42:20,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1625148.0, ans=0.0 2023-10-14 06:42:25,350 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:42:33,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1625194.6666666667, ans=0.0 2023-10-14 06:42:36,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1625241.3333333333, ans=0.0 2023-10-14 06:42:49,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1625288.0, ans=0.125 2023-10-14 06:43:09,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1625381.3333333333, ans=0.125 2023-10-14 06:43:13,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.724e+02 1.850e+02 2.104e+02 2.895e+02, threshold=3.699e+02, percent-clipped=0.0 2023-10-14 06:43:53,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1625568.0, ans=0.1 2023-10-14 06:44:02,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1625614.6666666667, ans=0.2 2023-10-14 06:44:27,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1625708.0, ans=0.0 2023-10-14 06:44:28,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1625708.0, ans=0.1 2023-10-14 06:44:33,759 INFO [train.py:1031] (0/4) Epoch 26, batch 7000, loss[loss=0.21, simple_loss=0.2941, pruned_loss=0.06297, over 16951.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.279, pruned_loss=0.04725, over 31776777.91 frames. ], batch size: 110, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 06:44:58,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1625801.3333333333, ans=0.0 2023-10-14 06:45:00,847 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.58 vs. limit=6.0 2023-10-14 06:45:04,424 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.846e+02 2.034e+02 2.154e+02 3.020e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-14 06:45:32,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1625988.0, ans=0.2 2023-10-14 06:45:38,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1625988.0, ans=0.125 2023-10-14 06:45:53,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1626081.3333333333, ans=0.125 2023-10-14 06:46:00,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1626081.3333333333, ans=0.125 2023-10-14 06:46:20,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1626174.6666666667, ans=0.125 2023-10-14 06:46:30,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1626221.3333333333, ans=0.0 2023-10-14 06:46:30,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1626221.3333333333, ans=0.0 2023-10-14 06:46:47,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1626268.0, ans=0.125 2023-10-14 06:46:49,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1626314.6666666667, ans=0.2 2023-10-14 06:46:56,221 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.908e+02 2.092e+02 2.328e+02 3.399e+02, threshold=4.183e+02, percent-clipped=0.0 2023-10-14 06:47:34,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=5.52 vs. limit=15.0 2023-10-14 06:47:44,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1626501.3333333333, ans=0.125 2023-10-14 06:47:52,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.90 vs. limit=6.0 2023-10-14 06:47:53,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1626548.0, ans=0.0 2023-10-14 06:48:05,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1626594.6666666667, ans=0.2 2023-10-14 06:48:35,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1626688.0, ans=0.125 2023-10-14 06:48:51,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1626734.6666666667, ans=0.0 2023-10-14 06:48:57,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.749e+02 1.846e+02 2.013e+02 2.849e+02, threshold=3.693e+02, percent-clipped=0.0 2023-10-14 06:49:16,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1626874.6666666667, ans=0.2 2023-10-14 06:49:19,216 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-10-14 06:49:44,676 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=22.5 2023-10-14 06:50:05,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1627061.3333333333, ans=0.125 2023-10-14 06:50:35,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.27 vs. limit=10.0 2023-10-14 06:50:58,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.825e+02 1.930e+02 2.136e+02 3.235e+02, threshold=3.859e+02, percent-clipped=0.0 2023-10-14 06:50:59,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1627248.0, ans=0.07 2023-10-14 06:51:00,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1627248.0, ans=0.0 2023-10-14 06:51:11,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1627294.6666666667, ans=0.125 2023-10-14 06:51:22,129 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.09 vs. limit=22.5 2023-10-14 06:51:34,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1627388.0, ans=0.2 2023-10-14 06:51:36,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1627388.0, ans=0.1 2023-10-14 06:51:48,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1627434.6666666667, ans=0.09899494936611666 2023-10-14 06:52:01,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.35 vs. limit=15.0 2023-10-14 06:52:01,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1627528.0, ans=0.1 2023-10-14 06:52:14,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.74 vs. limit=10.0 2023-10-14 06:52:19,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1627574.6666666667, ans=0.1 2023-10-14 06:52:24,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1627621.3333333333, ans=0.125 2023-10-14 06:52:27,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1627621.3333333333, ans=0.2 2023-10-14 06:52:34,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1627668.0, ans=0.125 2023-10-14 06:52:37,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1627668.0, ans=0.0 2023-10-14 06:52:43,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1627714.6666666667, ans=0.1 2023-10-14 06:52:48,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.873e+02 2.049e+02 2.304e+02 3.204e+02, threshold=4.098e+02, percent-clipped=0.0 2023-10-14 06:53:14,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1627854.6666666667, ans=0.025 2023-10-14 06:54:02,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1628041.3333333333, ans=0.1 2023-10-14 06:54:07,162 INFO [train.py:1031] (0/4) Epoch 26, batch 7500, loss[loss=0.1832, simple_loss=0.271, pruned_loss=0.04773, over 16627.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2789, pruned_loss=0.04729, over 32002625.18 frames. ], batch size: 61, lr: 1.32e-03, grad_scale: 16.0 2023-10-14 06:54:34,220 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.837e+02 1.980e+02 2.200e+02 4.370e+02, threshold=3.961e+02, percent-clipped=1.0 2023-10-14 06:54:40,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1628228.0, ans=0.1 2023-10-14 06:54:44,101 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 06:54:55,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1628274.6666666667, ans=0.0 2023-10-14 06:55:03,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1628321.3333333333, ans=0.1 2023-10-14 06:55:14,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1628368.0, ans=0.0 2023-10-14 06:55:48,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1628508.0, ans=0.0 2023-10-14 06:55:58,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1628554.6666666667, ans=0.125 2023-10-14 06:56:25,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1628648.0, ans=0.0 2023-10-14 06:56:25,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1628648.0, ans=0.0 2023-10-14 06:56:32,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.883e+02 2.051e+02 2.276e+02 3.072e+02, threshold=4.103e+02, percent-clipped=0.0 2023-10-14 06:56:41,680 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.44 vs. limit=10.0 2023-10-14 06:56:46,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1628694.6666666667, ans=0.125 2023-10-14 06:57:05,788 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=12.0 2023-10-14 06:57:34,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1628881.3333333333, ans=6.0 2023-10-14 06:57:43,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1628928.0, ans=0.125 2023-10-14 06:57:52,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1628928.0, ans=0.125 2023-10-14 06:58:09,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1629021.3333333333, ans=0.125 2023-10-14 06:58:30,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1629114.6666666667, ans=0.125 2023-10-14 06:58:35,623 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.783e+02 1.973e+02 2.220e+02 3.119e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-14 06:58:43,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1629161.3333333333, ans=0.125 2023-10-14 06:58:52,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=1629208.0, ans=10.0 2023-10-14 06:59:02,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.69 vs. limit=15.0 2023-10-14 06:59:21,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1629301.3333333333, ans=0.2 2023-10-14 06:59:26,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1629348.0, ans=0.125 2023-10-14 07:00:03,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1629488.0, ans=0.125 2023-10-14 07:00:08,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1629534.6666666667, ans=0.09899494936611666 2023-10-14 07:00:09,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1629534.6666666667, ans=0.025 2023-10-14 07:00:22,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1629581.3333333333, ans=0.125 2023-10-14 07:00:28,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.838e+02 2.031e+02 2.252e+02 3.390e+02, threshold=4.061e+02, percent-clipped=0.0 2023-10-14 07:00:36,526 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-10-14 07:00:47,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1629674.6666666667, ans=0.125 2023-10-14 07:00:55,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1629674.6666666667, ans=0.125 2023-10-14 07:01:08,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1629768.0, ans=0.125 2023-10-14 07:01:30,956 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=15.0 2023-10-14 07:01:45,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1629908.0, ans=0.125 2023-10-14 07:01:45,628 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:01:51,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.93 vs. limit=15.0 2023-10-14 07:02:09,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1630001.3333333333, ans=0.125 2023-10-14 07:02:09,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1630001.3333333333, ans=0.0 2023-10-14 07:02:22,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1630048.0, ans=0.02 2023-10-14 07:02:27,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.824e+02 1.978e+02 2.209e+02 2.929e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-14 07:02:39,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1630094.6666666667, ans=0.2 2023-10-14 07:02:42,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1630094.6666666667, ans=0.0 2023-10-14 07:02:42,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1630094.6666666667, ans=0.125 2023-10-14 07:02:43,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-10-14 07:02:51,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1630141.3333333333, ans=0.125 2023-10-14 07:02:55,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1630188.0, ans=0.0 2023-10-14 07:03:12,256 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.05 vs. limit=15.0 2023-10-14 07:03:16,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.38 vs. limit=15.0 2023-10-14 07:03:20,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1630281.3333333333, ans=0.125 2023-10-14 07:03:44,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1630374.6666666667, ans=0.04949747468305833 2023-10-14 07:03:48,922 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.84 vs. limit=15.0 2023-10-14 07:03:53,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1630421.3333333333, ans=0.0 2023-10-14 07:03:54,403 INFO [train.py:1031] (0/4) Epoch 26, batch 8000, loss[loss=0.1962, simple_loss=0.2759, pruned_loss=0.05827, over 15548.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2784, pruned_loss=0.0468, over 32175621.51 frames. ], batch size: 350, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 07:03:55,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1630421.3333333333, ans=0.125 2023-10-14 07:04:17,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1630514.6666666667, ans=0.0 2023-10-14 07:04:22,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.677e+02 1.841e+02 2.094e+02 3.064e+02, threshold=3.681e+02, percent-clipped=0.0 2023-10-14 07:04:25,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1630561.3333333333, ans=0.2 2023-10-14 07:04:27,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1630561.3333333333, ans=0.125 2023-10-14 07:04:39,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=1630608.0, ans=15.0 2023-10-14 07:05:28,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1630794.6666666667, ans=0.0 2023-10-14 07:05:41,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1630888.0, ans=0.125 2023-10-14 07:05:49,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1630888.0, ans=0.0 2023-10-14 07:05:57,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1630934.6666666667, ans=0.125 2023-10-14 07:05:58,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1630934.6666666667, ans=0.0 2023-10-14 07:06:11,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.803e+02 1.945e+02 2.257e+02 2.887e+02, threshold=3.891e+02, percent-clipped=0.0 2023-10-14 07:06:26,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1631074.6666666667, ans=12.0 2023-10-14 07:06:43,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1631121.3333333333, ans=0.125 2023-10-14 07:06:47,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1631121.3333333333, ans=0.0 2023-10-14 07:07:24,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1631214.6666666667, ans=0.125 2023-10-14 07:07:50,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1631308.0, ans=0.1 2023-10-14 07:07:50,774 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=12.0 2023-10-14 07:07:56,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1631354.6666666667, ans=0.125 2023-10-14 07:07:56,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1631354.6666666667, ans=0.025 2023-10-14 07:07:58,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1631354.6666666667, ans=0.1 2023-10-14 07:08:17,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1631448.0, ans=0.0 2023-10-14 07:08:18,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1631448.0, ans=0.0 2023-10-14 07:08:23,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.825e+02 2.035e+02 2.254e+02 3.659e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-14 07:08:53,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1631588.0, ans=0.1 2023-10-14 07:09:05,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1631634.6666666667, ans=0.125 2023-10-14 07:09:10,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1631634.6666666667, ans=15.0 2023-10-14 07:10:20,009 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.438e+02 1.822e+02 1.986e+02 2.191e+02 2.957e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-14 07:10:26,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1631961.3333333333, ans=0.0 2023-10-14 07:10:31,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1631961.3333333333, ans=0.0 2023-10-14 07:10:42,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1632008.0, ans=0.1 2023-10-14 07:10:45,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1632054.6666666667, ans=0.1 2023-10-14 07:10:46,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1632054.6666666667, ans=0.125 2023-10-14 07:10:48,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=1632054.6666666667, ans=0.5 2023-10-14 07:10:48,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-10-14 07:10:56,603 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.70 vs. limit=22.5 2023-10-14 07:10:59,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.35 vs. limit=12.0 2023-10-14 07:10:59,366 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-10-14 07:11:09,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1632148.0, ans=0.125 2023-10-14 07:11:13,290 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.61 vs. limit=22.5 2023-10-14 07:11:26,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1632194.6666666667, ans=0.2 2023-10-14 07:11:40,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1632288.0, ans=10.0 2023-10-14 07:11:40,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.59 vs. limit=12.0 2023-10-14 07:11:49,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1632288.0, ans=0.2 2023-10-14 07:11:49,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1632288.0, ans=0.0 2023-10-14 07:11:59,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1632334.6666666667, ans=0.125 2023-10-14 07:11:59,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1632334.6666666667, ans=0.0 2023-10-14 07:12:13,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.852e+02 1.974e+02 2.134e+02 3.639e+02, threshold=3.949e+02, percent-clipped=0.0 2023-10-14 07:12:14,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1632381.3333333333, ans=0.0 2023-10-14 07:12:20,698 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.55 vs. limit=15.0 2023-10-14 07:12:27,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1632428.0, ans=0.0 2023-10-14 07:12:30,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1632474.6666666667, ans=0.5 2023-10-14 07:12:37,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1632474.6666666667, ans=0.1 2023-10-14 07:12:49,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.17 vs. limit=10.0 2023-10-14 07:13:00,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1632568.0, ans=0.0 2023-10-14 07:13:08,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1632614.6666666667, ans=0.0 2023-10-14 07:13:23,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1632661.3333333333, ans=0.0 2023-10-14 07:13:34,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1632708.0, ans=0.125 2023-10-14 07:13:40,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1632754.6666666667, ans=0.0 2023-10-14 07:13:41,163 INFO [train.py:1031] (0/4) Epoch 26, batch 8500, loss[loss=0.1809, simple_loss=0.277, pruned_loss=0.0424, over 16885.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2789, pruned_loss=0.04689, over 32310968.53 frames. ], batch size: 110, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 07:14:12,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.924e+02 2.121e+02 2.299e+02 3.263e+02, threshold=4.243e+02, percent-clipped=0.0 2023-10-14 07:14:12,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1632848.0, ans=0.1 2023-10-14 07:14:23,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1632894.6666666667, ans=0.0 2023-10-14 07:14:26,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1632941.3333333333, ans=0.125 2023-10-14 07:14:30,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1632941.3333333333, ans=0.125 2023-10-14 07:14:42,512 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=22.5 2023-10-14 07:14:44,802 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:14:45,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1632988.0, ans=0.2 2023-10-14 07:14:55,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1633034.6666666667, ans=0.125 2023-10-14 07:15:02,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1633081.3333333333, ans=0.0 2023-10-14 07:15:24,762 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:15:33,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.91 vs. limit=15.0 2023-10-14 07:15:57,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1633268.0, ans=0.0 2023-10-14 07:16:13,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.783e+02 1.985e+02 2.199e+02 3.135e+02, threshold=3.970e+02, percent-clipped=0.0 2023-10-14 07:16:24,739 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:16:31,864 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:16:36,718 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:16:43,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1633454.6666666667, ans=0.0 2023-10-14 07:16:47,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1633454.6666666667, ans=0.0 2023-10-14 07:16:58,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1633501.3333333333, ans=0.09899494936611666 2023-10-14 07:17:20,767 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-10-14 07:17:21,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1633594.6666666667, ans=0.125 2023-10-14 07:17:22,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-10-14 07:17:26,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1633594.6666666667, ans=0.2 2023-10-14 07:17:33,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1633641.3333333333, ans=0.0 2023-10-14 07:17:38,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1633688.0, ans=0.0 2023-10-14 07:17:49,725 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:18:01,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1633781.3333333333, ans=0.0 2023-10-14 07:18:03,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=1633781.3333333333, ans=22.5 2023-10-14 07:18:07,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1633781.3333333333, ans=0.1 2023-10-14 07:18:08,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1633781.3333333333, ans=0.2 2023-10-14 07:18:12,530 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.720e+02 1.918e+02 2.109e+02 2.890e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-14 07:18:15,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1633828.0, ans=0.1 2023-10-14 07:18:18,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1633828.0, ans=0.0 2023-10-14 07:18:21,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1633828.0, ans=0.0 2023-10-14 07:18:21,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1633828.0, ans=0.125 2023-10-14 07:18:54,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1633968.0, ans=0.2 2023-10-14 07:19:02,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1633968.0, ans=0.125 2023-10-14 07:19:06,638 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1634014.6666666667, ans=0.0 2023-10-14 07:19:52,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1634201.3333333333, ans=0.1 2023-10-14 07:19:52,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1634201.3333333333, ans=0.125 2023-10-14 07:19:53,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1634201.3333333333, ans=0.125 2023-10-14 07:19:56,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1634201.3333333333, ans=0.125 2023-10-14 07:19:58,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1634201.3333333333, ans=0.0 2023-10-14 07:20:12,009 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.785e+02 1.870e+02 2.236e+02 3.649e+02, threshold=3.739e+02, percent-clipped=0.0 2023-10-14 07:20:13,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1634294.6666666667, ans=0.0 2023-10-14 07:20:25,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1634341.3333333333, ans=0.125 2023-10-14 07:20:30,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1634341.3333333333, ans=0.125 2023-10-14 07:20:39,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1634388.0, ans=0.1 2023-10-14 07:20:40,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1634388.0, ans=0.0 2023-10-14 07:20:48,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1634434.6666666667, ans=0.0 2023-10-14 07:20:53,408 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:21:00,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1634481.3333333333, ans=0.0 2023-10-14 07:22:00,134 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.838e+02 2.063e+02 2.261e+02 3.155e+02, threshold=4.125e+02, percent-clipped=0.0 2023-10-14 07:22:03,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1634761.3333333333, ans=0.125 2023-10-14 07:22:21,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.16 vs. limit=22.5 2023-10-14 07:22:50,475 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.36 vs. limit=15.0 2023-10-14 07:23:04,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1634994.6666666667, ans=0.125 2023-10-14 07:23:07,003 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-10-14 07:23:17,804 INFO [train.py:1031] (0/4) Epoch 26, batch 9000, loss[loss=0.1941, simple_loss=0.2955, pruned_loss=0.04632, over 16729.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2784, pruned_loss=0.04669, over 32455344.26 frames. ], batch size: 202, lr: 1.32e-03, grad_scale: 16.0 2023-10-14 07:23:21,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1635088.0, ans=0.0 2023-10-14 07:23:27,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1635088.0, ans=0.125 2023-10-14 07:23:47,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1635181.3333333333, ans=0.125 2023-10-14 07:23:49,698 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.838e+02 2.026e+02 2.317e+02 2.925e+02, threshold=4.051e+02, percent-clipped=0.0 2023-10-14 07:23:58,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1635228.0, ans=0.0 2023-10-14 07:24:03,808 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.15 vs. limit=15.0 2023-10-14 07:24:10,206 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.63 vs. limit=10.0 2023-10-14 07:24:10,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1635321.3333333333, ans=0.125 2023-10-14 07:24:19,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=22.5 2023-10-14 07:24:25,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1635368.0, ans=0.1 2023-10-14 07:24:31,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1635368.0, ans=0.125 2023-10-14 07:24:32,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1635414.6666666667, ans=0.1 2023-10-14 07:24:36,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1635414.6666666667, ans=0.0 2023-10-14 07:24:36,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.56 vs. limit=15.0 2023-10-14 07:24:54,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1635508.0, ans=0.0 2023-10-14 07:25:10,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1635554.6666666667, ans=0.125 2023-10-14 07:25:35,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.773e+02 1.933e+02 2.293e+02 3.051e+02, threshold=3.866e+02, percent-clipped=0.0 2023-10-14 07:25:36,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1635694.6666666667, ans=0.125 2023-10-14 07:25:42,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1635694.6666666667, ans=0.125 2023-10-14 07:25:45,366 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:26:05,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1635834.6666666667, ans=0.125 2023-10-14 07:26:09,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1635834.6666666667, ans=0.0 2023-10-14 07:26:23,627 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:26:33,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1635928.0, ans=0.125 2023-10-14 07:26:34,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1635928.0, ans=0.0 2023-10-14 07:26:41,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1635974.6666666667, ans=0.09899494936611666 2023-10-14 07:26:50,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1636021.3333333333, ans=0.2 2023-10-14 07:26:51,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1636021.3333333333, ans=0.0 2023-10-14 07:26:58,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1636068.0, ans=0.2 2023-10-14 07:27:17,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.854e+02 1.969e+02 2.199e+02 3.397e+02, threshold=3.937e+02, percent-clipped=0.0 2023-10-14 07:27:20,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1636161.3333333333, ans=0.125 2023-10-14 07:27:29,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1636208.0, ans=0.0 2023-10-14 07:27:29,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1636208.0, ans=0.125 2023-10-14 07:27:30,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1636208.0, ans=0.2 2023-10-14 07:27:41,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1636254.6666666667, ans=0.2 2023-10-14 07:27:43,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1636254.6666666667, ans=0.1 2023-10-14 07:27:54,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1636301.3333333333, ans=0.2 2023-10-14 07:28:17,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1636394.6666666667, ans=0.2 2023-10-14 07:28:20,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1636394.6666666667, ans=0.125 2023-10-14 07:28:23,793 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:28:30,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.26 vs. limit=22.5 2023-10-14 07:28:56,834 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.16 vs. limit=22.5 2023-10-14 07:28:59,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1636581.3333333333, ans=0.0 2023-10-14 07:28:59,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1636581.3333333333, ans=0.2 2023-10-14 07:29:02,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-10-14 07:29:04,581 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.909e+02 2.128e+02 2.349e+02 3.082e+02, threshold=4.255e+02, percent-clipped=0.0 2023-10-14 07:29:37,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1636721.3333333333, ans=0.0 2023-10-14 07:30:08,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1636814.6666666667, ans=0.125 2023-10-14 07:30:18,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1636861.3333333333, ans=0.125 2023-10-14 07:30:19,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1636861.3333333333, ans=0.0 2023-10-14 07:30:29,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1636908.0, ans=0.0 2023-10-14 07:30:33,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1636908.0, ans=0.125 2023-10-14 07:30:35,953 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.37 vs. limit=22.5 2023-10-14 07:30:38,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1636954.6666666667, ans=0.125 2023-10-14 07:30:45,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1637001.3333333333, ans=0.0 2023-10-14 07:30:52,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1637001.3333333333, ans=0.125 2023-10-14 07:30:54,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1637001.3333333333, ans=0.125 2023-10-14 07:30:55,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1637001.3333333333, ans=0.0 2023-10-14 07:31:07,497 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.775e+02 1.959e+02 2.152e+02 2.918e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-14 07:31:19,898 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-10-14 07:31:29,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-10-14 07:31:30,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1637141.3333333333, ans=0.0 2023-10-14 07:31:41,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1637188.0, ans=0.125 2023-10-14 07:31:56,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1637281.3333333333, ans=0.125 2023-10-14 07:32:09,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1637328.0, ans=10.0 2023-10-14 07:32:16,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=8.0 2023-10-14 07:32:18,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=12.0 2023-10-14 07:32:31,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-10-14 07:32:34,336 INFO [train.py:1031] (0/4) Epoch 26, batch 9500, loss[loss=0.182, simple_loss=0.2836, pruned_loss=0.04017, over 16843.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2791, pruned_loss=0.04706, over 32496016.67 frames. ], batch size: 98, lr: 1.32e-03, grad_scale: 32.0 2023-10-14 07:32:34,842 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-10-14 07:32:47,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1637468.0, ans=0.0 2023-10-14 07:32:58,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1637514.6666666667, ans=0.1 2023-10-14 07:33:04,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.857e+02 2.010e+02 2.230e+02 3.118e+02, threshold=4.020e+02, percent-clipped=0.0 2023-10-14 07:33:15,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1637561.3333333333, ans=0.125 2023-10-14 07:33:23,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1637608.0, ans=0.1 2023-10-14 07:33:23,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1637608.0, ans=0.2 2023-10-14 07:33:26,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.50 vs. limit=15.0 2023-10-14 07:33:28,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1637654.6666666667, ans=0.125 2023-10-14 07:33:29,029 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=12.0 2023-10-14 07:33:54,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1637748.0, ans=0.95 2023-10-14 07:34:00,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1637794.6666666667, ans=0.035 2023-10-14 07:34:39,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1637934.6666666667, ans=0.1 2023-10-14 07:34:51,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1637981.3333333333, ans=10.0 2023-10-14 07:34:55,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.830e+02 1.975e+02 2.227e+02 3.313e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-14 07:35:09,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1638074.6666666667, ans=0.125 2023-10-14 07:35:11,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1638074.6666666667, ans=0.5 2023-10-14 07:35:19,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1638121.3333333333, ans=0.125 2023-10-14 07:35:49,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1638214.6666666667, ans=0.125 2023-10-14 07:35:58,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1638261.3333333333, ans=0.125 2023-10-14 07:36:03,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1638261.3333333333, ans=0.125 2023-10-14 07:36:15,770 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.95 vs. limit=22.5 2023-10-14 07:36:20,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1638354.6666666667, ans=0.0 2023-10-14 07:36:25,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1638354.6666666667, ans=0.125 2023-10-14 07:36:47,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.883e+02 2.041e+02 2.312e+02 4.063e+02, threshold=4.082e+02, percent-clipped=1.0 2023-10-14 07:36:55,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1638494.6666666667, ans=0.0 2023-10-14 07:36:57,249 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.32 vs. limit=10.0 2023-10-14 07:37:21,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1638634.6666666667, ans=0.125 2023-10-14 07:37:37,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1638681.3333333333, ans=0.125 2023-10-14 07:37:42,510 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:37:52,725 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:37:54,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.67 vs. limit=15.0 2023-10-14 07:37:56,791 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.68 vs. limit=22.5 2023-10-14 07:37:59,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1638774.6666666667, ans=0.0 2023-10-14 07:38:00,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1638774.6666666667, ans=0.125 2023-10-14 07:38:02,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1638774.6666666667, ans=0.1 2023-10-14 07:38:39,901 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.796e+02 1.968e+02 2.217e+02 3.431e+02, threshold=3.936e+02, percent-clipped=0.0 2023-10-14 07:38:40,324 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:38:40,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-10-14 07:38:53,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1639008.0, ans=0.125 2023-10-14 07:38:55,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1639008.0, ans=0.2 2023-10-14 07:38:57,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1639008.0, ans=0.0 2023-10-14 07:39:06,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1639054.6666666667, ans=0.0 2023-10-14 07:39:17,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.16 vs. limit=22.5 2023-10-14 07:39:18,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1639101.3333333333, ans=0.125 2023-10-14 07:39:19,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.12 vs. limit=22.5 2023-10-14 07:39:58,730 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:39:58,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1639288.0, ans=0.1 2023-10-14 07:39:59,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1639288.0, ans=0.0 2023-10-14 07:40:14,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1639334.6666666667, ans=0.125 2023-10-14 07:40:24,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1639381.3333333333, ans=0.2 2023-10-14 07:40:24,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1639381.3333333333, ans=0.125 2023-10-14 07:40:29,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1639428.0, ans=0.125 2023-10-14 07:40:30,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.890e+02 2.050e+02 2.341e+02 3.194e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-14 07:40:38,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1639428.0, ans=0.125 2023-10-14 07:40:49,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1639474.6666666667, ans=0.015 2023-10-14 07:41:09,353 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 07:41:12,969 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1639614.6666666667, ans=0.0 2023-10-14 07:41:31,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1639661.3333333333, ans=0.1 2023-10-14 07:41:41,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1639708.0, ans=0.125 2023-10-14 07:41:43,479 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.30 vs. limit=10.0 2023-10-14 07:41:45,002 INFO [train.py:1031] (0/4) Epoch 26, batch 10000, loss[loss=0.1747, simple_loss=0.2715, pruned_loss=0.03899, over 16949.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2783, pruned_loss=0.04676, over 32551500.12 frames. ], batch size: 138, lr: 1.31e-03, grad_scale: 16.0 2023-10-14 07:41:50,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1639754.6666666667, ans=0.125 2023-10-14 07:42:12,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1639848.0, ans=0.0 2023-10-14 07:42:17,890 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.40 vs. limit=15.0 2023-10-14 07:42:18,184 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.836e+02 1.952e+02 2.087e+02 2.619e+02, threshold=3.905e+02, percent-clipped=0.0 2023-10-14 07:42:26,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1639894.6666666667, ans=0.125 2023-10-14 07:42:35,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1639941.3333333333, ans=0.1 2023-10-14 07:42:35,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1639941.3333333333, ans=0.0 2023-10-14 07:42:39,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1639988.0, ans=0.125 2023-10-14 07:42:59,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1640081.3333333333, ans=0.125 2023-10-14 07:43:05,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1640081.3333333333, ans=0.2 2023-10-14 07:43:10,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1640081.3333333333, ans=0.125 2023-10-14 07:43:17,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1640128.0, ans=0.0 2023-10-14 07:43:28,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.19 vs. limit=12.0 2023-10-14 07:43:30,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1640174.6666666667, ans=0.1 2023-10-14 07:43:30,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1640174.6666666667, ans=0.2 2023-10-14 07:44:13,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1640361.3333333333, ans=0.125 2023-10-14 07:44:13,821 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.803e+02 1.982e+02 2.219e+02 2.772e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-14 07:44:36,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1640454.6666666667, ans=0.09899494936611666 2023-10-14 07:44:43,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1640501.3333333333, ans=0.1 2023-10-14 07:44:53,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-10-14 07:45:08,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1640594.6666666667, ans=0.125 2023-10-14 07:45:09,151 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.22 vs. limit=15.0 2023-10-14 07:45:10,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=15.0 2023-10-14 07:45:16,405 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-10-14 07:45:29,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1640688.0, ans=0.1 2023-10-14 07:45:46,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1640734.6666666667, ans=0.0 2023-10-14 07:45:52,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1640734.6666666667, ans=0.1 2023-10-14 07:46:09,144 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.856e+02 2.039e+02 2.307e+02 3.070e+02, threshold=4.077e+02, percent-clipped=0.0 2023-10-14 07:46:22,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1640874.6666666667, ans=0.125 2023-10-14 07:46:28,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-10-14 07:46:34,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1640921.3333333333, ans=0.0 2023-10-14 07:47:02,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1641014.6666666667, ans=0.125 2023-10-14 07:47:23,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1641108.0, ans=0.035 2023-10-14 07:47:23,960 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.78 vs. limit=15.0 2023-10-14 07:47:35,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1641154.6666666667, ans=0.0 2023-10-14 07:47:36,197 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=12.0 2023-10-14 07:47:40,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-10-14 07:47:49,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1641201.3333333333, ans=0.1 2023-10-14 07:48:05,106 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.768e+02 1.967e+02 2.199e+02 3.248e+02, threshold=3.934e+02, percent-clipped=0.0 2023-10-14 07:48:33,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1641388.0, ans=0.035 2023-10-14 07:48:56,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1641481.3333333333, ans=0.0 2023-10-14 07:48:59,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1641528.0, ans=0.2 2023-10-14 07:49:06,318 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.03 vs. limit=15.0 2023-10-14 07:49:29,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1641621.3333333333, ans=0.0 2023-10-14 07:49:36,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1641668.0, ans=0.0 2023-10-14 07:49:40,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1641668.0, ans=0.0 2023-10-14 07:49:45,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1641714.6666666667, ans=0.125 2023-10-14 07:49:51,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1641714.6666666667, ans=0.0 2023-10-14 07:50:01,900 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.846e+02 2.087e+02 2.315e+02 3.353e+02, threshold=4.174e+02, percent-clipped=0.0 2023-10-14 07:50:02,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1641761.3333333333, ans=0.125 2023-10-14 07:50:07,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1641761.3333333333, ans=0.0 2023-10-14 07:50:11,193 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.86 vs. limit=15.0 2023-10-14 07:50:33,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1641901.3333333333, ans=0.0 2023-10-14 07:50:47,877 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.54 vs. limit=12.0 2023-10-14 07:50:50,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1641948.0, ans=0.1 2023-10-14 07:50:51,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1641948.0, ans=0.125 2023-10-14 07:50:54,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1641948.0, ans=0.125 2023-10-14 07:51:17,723 INFO [train.py:1031] (0/4) Epoch 26, batch 10500, loss[loss=0.2418, simple_loss=0.31, pruned_loss=0.08687, over 15606.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2786, pruned_loss=0.04683, over 32593910.49 frames. ], batch size: 350, lr: 1.31e-03, grad_scale: 8.0 2023-10-14 07:51:19,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1642088.0, ans=0.125 2023-10-14 07:51:51,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.458e+02 1.769e+02 1.926e+02 2.128e+02 3.242e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-14 07:51:53,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642228.0, ans=0.1 2023-10-14 07:51:56,813 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-10-14 07:51:56,906 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.11 vs. limit=15.0 2023-10-14 07:52:12,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1642321.3333333333, ans=0.0 2023-10-14 07:52:19,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1642368.0, ans=0.125 2023-10-14 07:52:44,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1642414.6666666667, ans=0.2 2023-10-14 07:52:54,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1642461.3333333333, ans=0.2 2023-10-14 07:52:56,904 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.67 vs. limit=15.0 2023-10-14 07:53:25,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1642554.6666666667, ans=0.2 2023-10-14 07:53:30,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1642601.3333333333, ans=0.0 2023-10-14 07:53:40,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1642648.0, ans=0.125 2023-10-14 07:53:41,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1642648.0, ans=0.5 2023-10-14 07:53:44,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=1642648.0, ans=0.95 2023-10-14 07:53:45,698 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-352000.pt 2023-10-14 07:53:58,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1642694.6666666667, ans=0.125 2023-10-14 07:54:00,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.832e+02 2.023e+02 2.207e+02 2.917e+02, threshold=4.047e+02, percent-clipped=0.0 2023-10-14 07:54:12,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1642741.3333333333, ans=0.125 2023-10-14 07:54:14,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642741.3333333333, ans=0.1 2023-10-14 07:55:03,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642928.0, ans=0.1 2023-10-14 07:55:10,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1642974.6666666667, ans=0.125 2023-10-14 07:55:23,717 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-10-14 07:55:28,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1643068.0, ans=0.2 2023-10-14 07:55:46,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1643114.6666666667, ans=0.125 2023-10-14 07:55:53,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.821e+02 1.959e+02 2.139e+02 2.735e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-14 07:56:13,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1643254.6666666667, ans=0.125 2023-10-14 07:56:13,541 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-14 07:56:25,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1643301.3333333333, ans=0.2 2023-10-14 07:56:51,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1643394.6666666667, ans=0.0 2023-10-14 07:57:05,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1643441.3333333333, ans=0.2 2023-10-14 07:57:12,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1643488.0, ans=0.2 2023-10-14 07:57:25,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1643534.6666666667, ans=0.125 2023-10-14 07:57:27,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1643534.6666666667, ans=0.125 2023-10-14 07:57:44,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.874e+02 2.044e+02 2.319e+02 3.357e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-14 07:57:47,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1643628.0, ans=0.125 2023-10-14 07:57:51,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1643674.6666666667, ans=0.09899494936611666 2023-10-14 07:57:57,161 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.45 vs. limit=22.5 2023-10-14 07:58:23,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1643814.6666666667, ans=0.2 2023-10-14 07:58:29,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1643814.6666666667, ans=0.95 2023-10-14 07:58:30,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1643814.6666666667, ans=0.1 2023-10-14 07:58:38,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1643861.3333333333, ans=0.05 2023-10-14 07:58:40,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1643861.3333333333, ans=0.1 2023-10-14 07:58:42,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1643861.3333333333, ans=0.125 2023-10-14 07:58:42,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1643861.3333333333, ans=0.125 2023-10-14 07:58:57,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.72 vs. limit=10.0 2023-10-14 07:59:09,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1644001.3333333333, ans=6.0 2023-10-14 07:59:12,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1644001.3333333333, ans=0.2 2023-10-14 07:59:30,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1644094.6666666667, ans=0.0 2023-10-14 07:59:31,315 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.804e+02 1.970e+02 2.179e+02 3.025e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-14 08:00:31,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.14 vs. limit=22.5 2023-10-14 08:00:44,163 INFO [train.py:1031] (0/4) Epoch 26, batch 11000, loss[loss=0.1714, simple_loss=0.2677, pruned_loss=0.03751, over 16358.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2787, pruned_loss=0.04692, over 32621179.92 frames. ], batch size: 50, lr: 1.31e-03, grad_scale: 16.0 2023-10-14 08:00:48,488 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.48 vs. limit=10.0 2023-10-14 08:00:50,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1644421.3333333333, ans=0.125 2023-10-14 08:00:54,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1644468.0, ans=0.0 2023-10-14 08:01:09,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1644514.6666666667, ans=0.0 2023-10-14 08:01:20,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.881e+02 2.042e+02 2.239e+02 3.231e+02, threshold=4.085e+02, percent-clipped=0.0 2023-10-14 08:01:42,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1644654.6666666667, ans=0.1 2023-10-14 08:02:09,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1644748.0, ans=0.125 2023-10-14 08:02:21,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1644794.6666666667, ans=0.125 2023-10-14 08:02:29,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1644841.3333333333, ans=0.125 2023-10-14 08:02:39,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1644888.0, ans=0.125 2023-10-14 08:03:00,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1644934.6666666667, ans=0.0 2023-10-14 08:03:22,884 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.812e+02 1.960e+02 2.191e+02 2.946e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-14 08:03:28,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.79 vs. limit=12.0 2023-10-14 08:03:33,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1645074.6666666667, ans=0.5 2023-10-14 08:03:42,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1645074.6666666667, ans=0.125 2023-10-14 08:03:55,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1645168.0, ans=0.0 2023-10-14 08:04:10,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1645214.6666666667, ans=0.0 2023-10-14 08:04:16,917 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.04 vs. limit=6.0 2023-10-14 08:04:25,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1645261.3333333333, ans=10.0 2023-10-14 08:04:56,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1645401.3333333333, ans=0.1 2023-10-14 08:05:02,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1645448.0, ans=0.0 2023-10-14 08:05:04,202 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.24 vs. limit=22.5 2023-10-14 08:05:08,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1645448.0, ans=0.125 2023-10-14 08:05:09,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1645448.0, ans=0.125 2023-10-14 08:05:14,578 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.744e+02 1.943e+02 2.090e+02 2.856e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-14 08:05:23,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.35 vs. limit=15.0 2023-10-14 08:05:25,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1645541.3333333333, ans=10.0 2023-10-14 08:05:30,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-10-14 08:05:57,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.09 vs. limit=6.0 2023-10-14 08:05:59,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1645681.3333333333, ans=0.125 2023-10-14 08:06:14,803 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=22.5 2023-10-14 08:06:15,821 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=15.0 2023-10-14 08:06:18,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1645774.6666666667, ans=0.04949747468305833 2023-10-14 08:06:23,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1645774.6666666667, ans=0.04949747468305833 2023-10-14 08:06:24,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1645774.6666666667, ans=0.125 2023-10-14 08:06:26,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1645774.6666666667, ans=0.125 2023-10-14 08:06:52,954 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.273e-02 2023-10-14 08:07:04,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1645961.3333333333, ans=0.125 2023-10-14 08:07:06,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.788e+02 1.937e+02 2.194e+02 3.413e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-14 08:07:09,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1645961.3333333333, ans=0.2 2023-10-14 08:07:25,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1646008.0, ans=0.125 2023-10-14 08:07:31,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1646054.6666666667, ans=15.0 2023-10-14 08:07:38,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1646101.3333333333, ans=0.1 2023-10-14 08:08:06,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1646194.6666666667, ans=0.125 2023-10-14 08:08:27,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1646288.0, ans=0.1 2023-10-14 08:08:33,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1646334.6666666667, ans=0.0 2023-10-14 08:08:34,408 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.02 vs. limit=22.5 2023-10-14 08:08:50,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1646381.3333333333, ans=0.125 2023-10-14 08:08:59,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.863e+02 2.076e+02 2.399e+02 3.621e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-14 08:09:01,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1646428.0, ans=0.125 2023-10-14 08:09:09,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1646474.6666666667, ans=0.0 2023-10-14 08:09:17,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1646474.6666666667, ans=0.125 2023-10-14 08:09:23,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1646521.3333333333, ans=0.125 2023-10-14 08:09:39,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1646568.0, ans=0.0 2023-10-14 08:09:44,271 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.14 vs. limit=15.0 2023-10-14 08:10:03,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1646708.0, ans=0.125 2023-10-14 08:10:04,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.62 vs. limit=22.5 2023-10-14 08:10:10,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1646708.0, ans=0.2 2023-10-14 08:10:13,101 INFO [train.py:1031] (0/4) Epoch 26, batch 11500, loss[loss=0.191, simple_loss=0.2914, pruned_loss=0.04529, over 16870.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2785, pruned_loss=0.0469, over 32656078.92 frames. ], batch size: 188, lr: 1.31e-03, grad_scale: 32.0 2023-10-14 08:10:14,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1646754.6666666667, ans=0.125 2023-10-14 08:10:26,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1646801.3333333333, ans=0.125 2023-10-14 08:10:32,906 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.21 vs. limit=10.0 2023-10-14 08:10:48,118 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.928e+02 2.084e+02 2.340e+02 3.068e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-14 08:11:11,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1646988.0, ans=0.125 2023-10-14 08:11:24,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1647034.6666666667, ans=0.125 2023-10-14 08:11:24,446 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.29 vs. limit=15.0 2023-10-14 08:11:36,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1647081.3333333333, ans=0.125 2023-10-14 08:11:41,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1647081.3333333333, ans=0.0 2023-10-14 08:12:10,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1647221.3333333333, ans=0.2 2023-10-14 08:12:15,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1647221.3333333333, ans=0.1 2023-10-14 08:12:34,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1647314.6666666667, ans=0.125 2023-10-14 08:12:36,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1647314.6666666667, ans=0.125 2023-10-14 08:12:45,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.725e+02 1.859e+02 2.083e+02 2.852e+02, threshold=3.718e+02, percent-clipped=0.0 2023-10-14 08:12:48,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1647361.3333333333, ans=0.0 2023-10-14 08:13:01,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1647408.0, ans=0.125 2023-10-14 08:13:07,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1647454.6666666667, ans=0.2 2023-10-14 08:13:09,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1647454.6666666667, ans=0.0 2023-10-14 08:13:30,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=1647548.0, ans=0.02 2023-10-14 08:13:43,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.49 vs. limit=15.0 2023-10-14 08:14:05,731 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-14 08:14:25,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1647781.3333333333, ans=0.0 2023-10-14 08:14:30,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.793e+02 1.928e+02 2.096e+02 2.709e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-14 08:15:17,190 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:15:20,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1647968.0, ans=0.0 2023-10-14 08:15:29,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1648014.6666666667, ans=0.1 2023-10-14 08:15:30,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1648014.6666666667, ans=0.2 2023-10-14 08:15:52,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.38 vs. limit=22.5 2023-10-14 08:16:02,771 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-10-14 08:16:08,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1648154.6666666667, ans=0.1 2023-10-14 08:16:11,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1648201.3333333333, ans=0.125 2023-10-14 08:16:15,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1648201.3333333333, ans=0.0 2023-10-14 08:16:23,921 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1648248.0, ans=0.0 2023-10-14 08:16:28,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1648248.0, ans=10.0 2023-10-14 08:16:36,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-10-14 08:16:39,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.781e+02 1.932e+02 2.171e+02 3.419e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-14 08:17:15,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1648434.6666666667, ans=0.125 2023-10-14 08:17:24,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1648481.3333333333, ans=0.125 2023-10-14 08:17:31,657 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.83 vs. limit=15.0 2023-10-14 08:17:45,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1648528.0, ans=0.0 2023-10-14 08:17:54,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1648574.6666666667, ans=0.0 2023-10-14 08:17:55,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1648574.6666666667, ans=0.1 2023-10-14 08:18:01,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1648621.3333333333, ans=0.0 2023-10-14 08:18:14,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.21 vs. limit=22.5 2023-10-14 08:18:17,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1648668.0, ans=0.0 2023-10-14 08:18:34,756 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.804e+02 1.962e+02 2.291e+02 3.258e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-14 08:18:50,811 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.57 vs. limit=15.0 2023-10-14 08:18:56,173 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-10-14 08:19:08,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1648901.3333333333, ans=0.0 2023-10-14 08:19:13,343 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.62 vs. limit=15.0 2023-10-14 08:19:15,863 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:19:19,439 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.63 vs. limit=15.0 2023-10-14 08:19:49,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1649041.3333333333, ans=0.1 2023-10-14 08:19:49,781 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:19:52,754 INFO [train.py:1031] (0/4) Epoch 26, batch 12000, loss[loss=0.1758, simple_loss=0.2764, pruned_loss=0.03763, over 16817.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2787, pruned_loss=0.04672, over 32694150.45 frames. ], batch size: 98, lr: 1.31e-03, grad_scale: 32.0 2023-10-14 08:19:59,900 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.55 vs. limit=15.0 2023-10-14 08:20:07,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1649134.6666666667, ans=0.1 2023-10-14 08:20:28,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1649228.0, ans=0.1 2023-10-14 08:20:29,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1649228.0, ans=0.125 2023-10-14 08:20:31,303 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.829e+02 2.016e+02 2.256e+02 2.931e+02, threshold=4.032e+02, percent-clipped=0.0 2023-10-14 08:20:39,050 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:20:39,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1649274.6666666667, ans=0.0 2023-10-14 08:20:40,540 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=22.5 2023-10-14 08:21:15,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1649414.6666666667, ans=0.0 2023-10-14 08:21:20,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1649414.6666666667, ans=0.0 2023-10-14 08:21:23,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1649461.3333333333, ans=0.0 2023-10-14 08:21:26,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-10-14 08:21:31,724 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.58 vs. limit=6.0 2023-10-14 08:21:36,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1649508.0, ans=10.0 2023-10-14 08:21:39,582 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.39 vs. limit=15.0 2023-10-14 08:21:48,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1649554.6666666667, ans=0.125 2023-10-14 08:22:02,716 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:22:02,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1649601.3333333333, ans=0.125 2023-10-14 08:22:04,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1649648.0, ans=0.1 2023-10-14 08:22:11,058 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-10-14 08:22:19,529 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.761e+02 1.886e+02 2.075e+02 3.034e+02, threshold=3.771e+02, percent-clipped=0.0 2023-10-14 08:22:23,249 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.37 vs. limit=10.0 2023-10-14 08:22:25,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1649741.3333333333, ans=0.125 2023-10-14 08:22:32,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1649741.3333333333, ans=0.1 2023-10-14 08:22:42,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1649788.0, ans=0.125 2023-10-14 08:22:52,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1649834.6666666667, ans=0.0 2023-10-14 08:23:05,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1649881.3333333333, ans=0.125 2023-10-14 08:23:20,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.30 vs. limit=15.0 2023-10-14 08:23:23,182 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.32 vs. limit=22.5 2023-10-14 08:23:28,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1650021.3333333333, ans=0.125 2023-10-14 08:23:32,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1650021.3333333333, ans=0.1 2023-10-14 08:23:32,753 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=12.0 2023-10-14 08:23:35,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1650021.3333333333, ans=0.2 2023-10-14 08:24:04,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1650161.3333333333, ans=0.125 2023-10-14 08:24:07,766 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.875e+02 2.070e+02 2.318e+02 3.312e+02, threshold=4.141e+02, percent-clipped=0.0 2023-10-14 08:24:08,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1650161.3333333333, ans=0.125 2023-10-14 08:24:15,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1650208.0, ans=0.125 2023-10-14 08:24:30,508 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-10-14 08:24:41,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.71 vs. limit=22.5 2023-10-14 08:24:41,247 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.52 vs. limit=15.0 2023-10-14 08:25:35,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1650534.6666666667, ans=0.05 2023-10-14 08:25:48,815 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.18 vs. limit=15.0 2023-10-14 08:25:52,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1650628.0, ans=0.07 2023-10-14 08:25:55,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.797e+02 1.989e+02 2.228e+02 3.382e+02, threshold=3.979e+02, percent-clipped=0.0 2023-10-14 08:26:09,380 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2023-10-14 08:26:19,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1650721.3333333333, ans=0.125 2023-10-14 08:26:20,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1650721.3333333333, ans=0.2 2023-10-14 08:26:31,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1650768.0, ans=0.125 2023-10-14 08:26:44,299 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=15.0 2023-10-14 08:26:49,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1650861.3333333333, ans=0.125 2023-10-14 08:26:59,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1650861.3333333333, ans=0.125 2023-10-14 08:27:13,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1650954.6666666667, ans=0.035 2023-10-14 08:27:26,806 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.99 vs. limit=15.0 2023-10-14 08:27:33,733 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.76 vs. limit=15.0 2023-10-14 08:27:42,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1651048.0, ans=0.1 2023-10-14 08:27:45,488 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.50 vs. limit=22.5 2023-10-14 08:27:48,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 1.882e+02 2.150e+02 2.420e+02 3.466e+02, threshold=4.301e+02, percent-clipped=0.0 2023-10-14 08:28:08,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1651188.0, ans=15.0 2023-10-14 08:28:23,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1651234.6666666667, ans=0.1 2023-10-14 08:28:23,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1651234.6666666667, ans=0.125 2023-10-14 08:28:38,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1651281.3333333333, ans=0.1 2023-10-14 08:29:00,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1651374.6666666667, ans=0.1 2023-10-14 08:29:03,862 INFO [train.py:1031] (0/4) Epoch 26, batch 12500, loss[loss=0.2012, simple_loss=0.2922, pruned_loss=0.05504, over 16193.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2785, pruned_loss=0.04685, over 32700399.69 frames. ], batch size: 43, lr: 1.31e-03, grad_scale: 32.0 2023-10-14 08:29:04,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1651421.3333333333, ans=0.2 2023-10-14 08:29:05,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1651421.3333333333, ans=0.125 2023-10-14 08:29:10,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1651421.3333333333, ans=0.0 2023-10-14 08:29:25,399 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.64 vs. limit=12.0 2023-10-14 08:29:39,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.814e+02 2.010e+02 2.210e+02 3.033e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-14 08:29:44,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.79 vs. limit=22.5 2023-10-14 08:29:56,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1651654.6666666667, ans=0.0 2023-10-14 08:30:00,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1651654.6666666667, ans=0.125 2023-10-14 08:30:05,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1651701.3333333333, ans=0.1 2023-10-14 08:30:17,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1651748.0, ans=0.125 2023-10-14 08:30:21,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1651748.0, ans=0.1 2023-10-14 08:30:36,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1651794.6666666667, ans=0.125 2023-10-14 08:30:45,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1651841.3333333333, ans=0.2 2023-10-14 08:30:55,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1651888.0, ans=0.0 2023-10-14 08:30:56,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1651888.0, ans=0.125 2023-10-14 08:31:13,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1651981.3333333333, ans=0.0 2023-10-14 08:31:27,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.851e+02 2.021e+02 2.328e+02 3.098e+02, threshold=4.043e+02, percent-clipped=0.0 2023-10-14 08:31:49,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1652121.3333333333, ans=0.125 2023-10-14 08:32:02,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1652168.0, ans=0.2 2023-10-14 08:32:13,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1652214.6666666667, ans=0.0 2023-10-14 08:32:18,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1652261.3333333333, ans=0.125 2023-10-14 08:32:52,464 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2023-10-14 08:32:54,052 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1652401.3333333333, ans=0.125 2023-10-14 08:32:56,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1652401.3333333333, ans=0.0 2023-10-14 08:33:02,873 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.12 vs. limit=22.5 2023-10-14 08:33:13,446 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.796e+02 1.910e+02 2.080e+02 3.328e+02, threshold=3.819e+02, percent-clipped=0.0 2023-10-14 08:33:17,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1652494.6666666667, ans=0.0 2023-10-14 08:33:17,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1652494.6666666667, ans=0.125 2023-10-14 08:34:35,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1652821.3333333333, ans=0.125 2023-10-14 08:34:49,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1652914.6666666667, ans=0.0 2023-10-14 08:34:50,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1652914.6666666667, ans=0.125 2023-10-14 08:35:01,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1652961.3333333333, ans=0.125 2023-10-14 08:35:01,614 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.823e+02 1.949e+02 2.101e+02 3.083e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-14 08:35:23,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1653054.6666666667, ans=0.0 2023-10-14 08:35:23,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1653054.6666666667, ans=0.09899494936611666 2023-10-14 08:35:24,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1653054.6666666667, ans=0.125 2023-10-14 08:35:36,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.27 vs. limit=15.0 2023-10-14 08:35:52,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1653148.0, ans=0.1 2023-10-14 08:35:52,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1653148.0, ans=0.2 2023-10-14 08:36:13,777 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-10-14 08:36:16,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1653241.3333333333, ans=0.125 2023-10-14 08:36:19,862 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.97 vs. limit=15.0 2023-10-14 08:36:27,995 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=22.5 2023-10-14 08:36:34,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1653334.6666666667, ans=0.0 2023-10-14 08:36:41,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1653381.3333333333, ans=0.5 2023-10-14 08:36:41,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1653381.3333333333, ans=0.2 2023-10-14 08:36:53,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.831e+02 2.006e+02 2.265e+02 3.278e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-14 08:37:08,321 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-10-14 08:37:10,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-10-14 08:37:30,502 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.07 vs. limit=12.0 2023-10-14 08:37:44,575 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-10-14 08:37:48,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1653661.3333333333, ans=0.125 2023-10-14 08:37:48,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1653661.3333333333, ans=0.0 2023-10-14 08:37:54,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1653708.0, ans=0.1 2023-10-14 08:37:57,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1653708.0, ans=0.2 2023-10-14 08:38:02,093 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.17 vs. limit=15.0 2023-10-14 08:38:03,330 INFO [train.py:1031] (0/4) Epoch 26, batch 13000, loss[loss=0.177, simple_loss=0.2626, pruned_loss=0.0457, over 16017.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2791, pruned_loss=0.04683, over 32734489.70 frames. ], batch size: 43, lr: 1.31e-03, grad_scale: 32.0 2023-10-14 08:38:03,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1653754.6666666667, ans=0.125 2023-10-14 08:38:45,414 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.852e+02 2.021e+02 2.157e+02 2.791e+02, threshold=4.043e+02, percent-clipped=0.0 2023-10-14 08:39:27,351 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.82 vs. limit=10.0 2023-10-14 08:39:37,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1654081.3333333333, ans=0.125 2023-10-14 08:39:38,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1654081.3333333333, ans=0.0 2023-10-14 08:39:49,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1654128.0, ans=0.2 2023-10-14 08:40:03,547 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:40:09,702 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:40:32,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1654314.6666666667, ans=0.1 2023-10-14 08:40:33,374 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.46 vs. limit=10.0 2023-10-14 08:40:40,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.836e+02 1.989e+02 2.242e+02 2.804e+02, threshold=3.978e+02, percent-clipped=0.0 2023-10-14 08:40:44,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1654361.3333333333, ans=0.125 2023-10-14 08:40:56,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1654454.6666666667, ans=0.125 2023-10-14 08:41:01,991 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-10-14 08:41:04,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1654454.6666666667, ans=0.125 2023-10-14 08:41:08,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1654501.3333333333, ans=0.2 2023-10-14 08:41:14,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1654501.3333333333, ans=0.5 2023-10-14 08:41:16,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1654501.3333333333, ans=0.0 2023-10-14 08:41:19,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1654548.0, ans=0.125 2023-10-14 08:41:28,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1654594.6666666667, ans=0.5 2023-10-14 08:42:05,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1654688.0, ans=0.0 2023-10-14 08:42:25,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1654781.3333333333, ans=0.0 2023-10-14 08:42:25,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1654781.3333333333, ans=0.125 2023-10-14 08:42:30,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1654828.0, ans=0.2 2023-10-14 08:42:34,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.386e+02 1.793e+02 1.931e+02 2.135e+02 2.785e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-14 08:42:47,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1654874.6666666667, ans=0.0 2023-10-14 08:43:04,742 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1654968.0, ans=0.125 2023-10-14 08:43:09,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1654968.0, ans=0.2 2023-10-14 08:44:13,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1655248.0, ans=0.2 2023-10-14 08:44:24,446 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.872e+02 2.006e+02 2.303e+02 3.317e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-14 08:44:37,609 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:44:45,197 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1655388.0, ans=0.015 2023-10-14 08:44:56,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.56 vs. limit=12.0 2023-10-14 08:44:58,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-10-14 08:45:39,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1655621.3333333333, ans=0.0 2023-10-14 08:46:00,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1655714.6666666667, ans=0.125 2023-10-14 08:46:11,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1655761.3333333333, ans=0.2 2023-10-14 08:46:12,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.814e+02 1.981e+02 2.199e+02 3.897e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-14 08:46:22,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1655808.0, ans=0.125 2023-10-14 08:46:25,108 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:47:19,773 INFO [train.py:1031] (0/4) Epoch 26, batch 13500, loss[loss=0.1875, simple_loss=0.282, pruned_loss=0.04655, over 16902.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2782, pruned_loss=0.04657, over 32752398.36 frames. ], batch size: 110, lr: 1.31e-03, grad_scale: 16.0 2023-10-14 08:47:33,840 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1656134.6666666667, ans=0.0 2023-10-14 08:48:02,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.752e+02 1.893e+02 2.130e+02 2.733e+02, threshold=3.785e+02, percent-clipped=0.0 2023-10-14 08:48:02,607 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:48:13,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1656274.6666666667, ans=0.1 2023-10-14 08:48:21,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1656321.3333333333, ans=0.125 2023-10-14 08:48:45,489 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-10-14 08:48:50,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1656461.3333333333, ans=0.2 2023-10-14 08:49:21,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1656601.3333333333, ans=0.125 2023-10-14 08:49:39,476 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.27 vs. limit=15.0 2023-10-14 08:49:42,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1656694.6666666667, ans=0.1 2023-10-14 08:49:45,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.847e+02 1.971e+02 2.185e+02 3.851e+02, threshold=3.942e+02, percent-clipped=1.0 2023-10-14 08:49:46,851 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1656694.6666666667, ans=0.1 2023-10-14 08:49:48,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1656741.3333333333, ans=0.125 2023-10-14 08:49:57,253 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.23 vs. limit=22.5 2023-10-14 08:50:01,579 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-26.pt 2023-10-14 08:50:28,520 INFO [train.py:1031] (0/4) Epoch 27, batch 0, loss[loss=0.1807, simple_loss=0.2701, pruned_loss=0.04561, over 16848.00 frames. ], tot_loss[loss=0.1807, simple_loss=0.2701, pruned_loss=0.04561, over 16848.00 frames. ], batch size: 72, lr: 1.28e-03, grad_scale: 32.0 2023-10-14 08:50:28,521 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-14 08:50:35,194 INFO [zipformer.py:1853] (0/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.3225, 3.8771, 4.2102, 3.8048], device='cuda:0') 2023-10-14 08:50:36,531 INFO [train.py:1063] (0/4) Epoch 27, validation: loss=0.2135, simple_loss=0.2999, pruned_loss=0.06353, over 1020973.00 frames. 2023-10-14 08:50:36,531 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-14 08:50:46,269 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:51:06,145 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.14 vs. limit=22.5 2023-10-14 08:51:08,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1656904.6666666667, ans=0.0 2023-10-14 08:51:09,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1656904.6666666667, ans=0.125 2023-10-14 08:51:12,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1656951.3333333333, ans=0.125 2023-10-14 08:51:13,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1656951.3333333333, ans=0.0 2023-10-14 08:51:41,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1657044.6666666667, ans=0.0 2023-10-14 08:51:53,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1657091.3333333333, ans=0.125 2023-10-14 08:52:07,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.793e+02 1.961e+02 2.210e+02 4.315e+02, threshold=3.921e+02, percent-clipped=1.0 2023-10-14 08:52:43,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1657324.6666666667, ans=0.125 2023-10-14 08:52:47,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1657371.3333333333, ans=0.0 2023-10-14 08:52:59,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1657418.0, ans=0.0 2023-10-14 08:53:09,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1657418.0, ans=0.125 2023-10-14 08:53:19,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1657464.6666666667, ans=0.125 2023-10-14 08:53:21,459 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:53:29,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1657511.3333333333, ans=0.1 2023-10-14 08:53:41,607 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:53:47,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1657604.6666666667, ans=0.125 2023-10-14 08:53:57,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.812e+02 1.998e+02 2.247e+02 3.365e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-14 08:54:01,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1657651.3333333333, ans=0.5 2023-10-14 08:54:07,447 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-10-14 08:54:33,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1657838.0, ans=0.0 2023-10-14 08:54:35,661 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.95 vs. limit=22.5 2023-10-14 08:54:48,512 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:55:07,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1657931.3333333333, ans=0.09899494936611666 2023-10-14 08:55:28,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1658024.6666666667, ans=0.2 2023-10-14 08:55:44,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1658118.0, ans=0.0 2023-10-14 08:55:47,183 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.818e+02 1.993e+02 2.271e+02 3.593e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-14 08:55:47,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1658118.0, ans=0.2 2023-10-14 08:55:47,916 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.63 vs. limit=15.0 2023-10-14 08:55:53,470 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 08:55:56,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1658164.6666666667, ans=0.125 2023-10-14 08:55:58,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1658164.6666666667, ans=0.125 2023-10-14 08:56:05,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1658211.3333333333, ans=0.2 2023-10-14 08:56:09,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.12 vs. limit=15.0 2023-10-14 08:56:26,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1658304.6666666667, ans=0.07 2023-10-14 08:56:34,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1658304.6666666667, ans=0.1 2023-10-14 08:56:38,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1658351.3333333333, ans=0.0 2023-10-14 08:56:51,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1658398.0, ans=0.05 2023-10-14 08:56:52,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1658398.0, ans=0.05 2023-10-14 08:57:19,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1658538.0, ans=0.125 2023-10-14 08:57:22,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1658538.0, ans=0.0 2023-10-14 08:57:23,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1658538.0, ans=0.1 2023-10-14 08:57:33,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.869e+02 2.061e+02 2.333e+02 3.241e+02, threshold=4.121e+02, percent-clipped=0.0 2023-10-14 08:57:34,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1658584.6666666667, ans=0.125 2023-10-14 08:57:39,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1658631.3333333333, ans=0.1 2023-10-14 08:57:46,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1658631.3333333333, ans=0.125 2023-10-14 08:57:49,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1658678.0, ans=0.0 2023-10-14 08:57:54,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1658678.0, ans=10.0 2023-10-14 08:58:23,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1658818.0, ans=0.2 2023-10-14 08:58:25,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1658818.0, ans=0.125 2023-10-14 08:58:33,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1658864.6666666667, ans=0.125 2023-10-14 08:58:49,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.90 vs. limit=15.0 2023-10-14 08:58:52,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1658911.3333333333, ans=0.125 2023-10-14 08:59:10,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1659004.6666666667, ans=0.04949747468305833 2023-10-14 08:59:22,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.839e+02 2.011e+02 2.290e+02 3.044e+02, threshold=4.022e+02, percent-clipped=0.0 2023-10-14 08:59:24,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-10-14 08:59:36,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1659098.0, ans=0.1 2023-10-14 08:59:41,457 INFO [train.py:1031] (0/4) Epoch 27, batch 500, loss[loss=0.1794, simple_loss=0.268, pruned_loss=0.04542, over 16809.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2786, pruned_loss=0.04756, over 7261532.44 frames. ], batch size: 146, lr: 1.28e-03, grad_scale: 32.0 2023-10-14 08:59:41,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1659144.6666666667, ans=0.0 2023-10-14 08:59:48,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1659144.6666666667, ans=0.0 2023-10-14 09:00:04,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1659238.0, ans=0.2 2023-10-14 09:00:07,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1659238.0, ans=0.125 2023-10-14 09:00:08,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1659238.0, ans=0.0 2023-10-14 09:00:08,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1659238.0, ans=0.95 2023-10-14 09:00:12,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1659238.0, ans=0.125 2023-10-14 09:00:16,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1659284.6666666667, ans=0.05 2023-10-14 09:00:27,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1659331.3333333333, ans=0.125 2023-10-14 09:00:29,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-10-14 09:00:44,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1659378.0, ans=0.125 2023-10-14 09:00:56,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-10-14 09:01:00,928 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 09:01:02,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1659471.3333333333, ans=0.0 2023-10-14 09:01:14,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.857e+02 2.069e+02 2.294e+02 3.025e+02, threshold=4.138e+02, percent-clipped=0.0 2023-10-14 09:01:21,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1659564.6666666667, ans=0.125 2023-10-14 09:01:29,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-10-14 09:01:58,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1659704.6666666667, ans=0.125 2023-10-14 09:02:00,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1659704.6666666667, ans=0.1 2023-10-14 09:02:33,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1659844.6666666667, ans=0.0 2023-10-14 09:02:35,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1659844.6666666667, ans=0.125 2023-10-14 09:02:54,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1659938.0, ans=0.09899494936611666 2023-10-14 09:03:03,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1659984.6666666667, ans=0.125 2023-10-14 09:03:05,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.823e+02 1.971e+02 2.185e+02 2.882e+02, threshold=3.943e+02, percent-clipped=0.0 2023-10-14 09:03:26,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1660078.0, ans=0.0 2023-10-14 09:04:49,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1660404.6666666667, ans=0.125 2023-10-14 09:04:57,307 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.837e+02 2.036e+02 2.228e+02 3.264e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-14 09:04:58,765 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 09:05:15,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.whiten.whitening_limit, batch_count=1660544.6666666667, ans=12.0 2023-10-14 09:05:23,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1660544.6666666667, ans=0.125 2023-10-14 09:05:43,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1660638.0, ans=0.125 2023-10-14 09:05:48,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1660638.0, ans=0.125 2023-10-14 09:05:58,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1660684.6666666667, ans=0.1 2023-10-14 09:06:01,740 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=22.5 2023-10-14 09:06:08,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1660731.3333333333, ans=0.125 2023-10-14 09:06:19,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1660778.0, ans=0.2 2023-10-14 09:06:22,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1660778.0, ans=0.1 2023-10-14 09:06:26,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1660824.6666666667, ans=0.125 2023-10-14 09:06:44,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1660871.3333333333, ans=0.125 2023-10-14 09:06:52,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.834e+02 1.973e+02 2.183e+02 3.067e+02, threshold=3.946e+02, percent-clipped=0.0 2023-10-14 09:07:17,531 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1661058.0, ans=0.2 2023-10-14 09:07:24,293 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.59 vs. limit=15.0 2023-10-14 09:07:26,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.77 vs. limit=15.0 2023-10-14 09:07:28,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1661104.6666666667, ans=0.125 2023-10-14 09:07:30,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1661104.6666666667, ans=0.0 2023-10-14 09:07:39,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.78 vs. limit=10.0 2023-10-14 09:07:40,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1661151.3333333333, ans=0.125 2023-10-14 09:07:42,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1661151.3333333333, ans=0.07 2023-10-14 09:07:45,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1661151.3333333333, ans=0.1 2023-10-14 09:08:01,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1661198.0, ans=0.1 2023-10-14 09:08:15,822 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.63 vs. limit=15.0 2023-10-14 09:08:32,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1661338.0, ans=0.1 2023-10-14 09:08:41,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1661384.6666666667, ans=0.125 2023-10-14 09:08:43,668 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.768e+02 1.895e+02 2.075e+02 2.674e+02, threshold=3.790e+02, percent-clipped=0.0 2023-10-14 09:08:53,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1661431.3333333333, ans=0.125 2023-10-14 09:08:59,886 INFO [train.py:1031] (0/4) Epoch 27, batch 1000, loss[loss=0.1686, simple_loss=0.2719, pruned_loss=0.03271, over 16974.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2797, pruned_loss=0.04766, over 12904317.52 frames. ], batch size: 93, lr: 1.28e-03, grad_scale: 16.0 2023-10-14 09:09:08,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1661478.0, ans=0.125 2023-10-14 09:09:20,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1661571.3333333333, ans=0.02 2023-10-14 09:09:30,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1661618.0, ans=0.125 2023-10-14 09:09:37,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1661618.0, ans=0.0 2023-10-14 09:09:44,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1661664.6666666667, ans=0.125 2023-10-14 09:09:48,860 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1661664.6666666667, ans=0.1 2023-10-14 09:10:26,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1661851.3333333333, ans=0.2 2023-10-14 09:10:28,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.892e+02 2.081e+02 2.315e+02 3.226e+02, threshold=4.161e+02, percent-clipped=0.0 2023-10-14 09:10:32,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1661898.0, ans=0.0 2023-10-14 09:10:32,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1661898.0, ans=0.05 2023-10-14 09:10:47,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1661944.6666666667, ans=0.125 2023-10-14 09:10:51,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=1661944.6666666667, ans=22.5 2023-10-14 09:11:13,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1662038.0, ans=0.05 2023-10-14 09:11:38,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.60 vs. limit=22.5 2023-10-14 09:12:08,166 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1662224.6666666667, ans=0.95 2023-10-14 09:12:20,042 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-10-14 09:12:20,105 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.03 vs. limit=15.0 2023-10-14 09:12:21,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662271.3333333333, ans=0.1 2023-10-14 09:12:24,030 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.80 vs. limit=12.0 2023-10-14 09:12:30,152 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-10-14 09:12:30,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.750e+02 1.885e+02 2.174e+02 3.228e+02, threshold=3.770e+02, percent-clipped=0.0 2023-10-14 09:12:41,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1662364.6666666667, ans=0.125 2023-10-14 09:12:43,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1662364.6666666667, ans=0.125 2023-10-14 09:12:43,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-10-14 09:12:50,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1662411.3333333333, ans=0.2 2023-10-14 09:12:56,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662458.0, ans=0.1 2023-10-14 09:12:58,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1662458.0, ans=0.125 2023-10-14 09:13:00,565 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.61 vs. limit=15.0 2023-10-14 09:13:16,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662504.6666666667, ans=0.1 2023-10-14 09:13:28,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1662598.0, ans=0.125 2023-10-14 09:13:43,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1662644.6666666667, ans=0.125 2023-10-14 09:13:44,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=22.5 2023-10-14 09:13:48,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1662644.6666666667, ans=0.1 2023-10-14 09:13:58,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1662691.3333333333, ans=0.0 2023-10-14 09:14:01,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1662738.0, ans=0.125 2023-10-14 09:14:03,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1662738.0, ans=0.125 2023-10-14 09:14:11,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1662784.6666666667, ans=0.125 2023-10-14 09:14:13,447 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=12.0 2023-10-14 09:14:15,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1662784.6666666667, ans=0.07 2023-10-14 09:14:16,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1662784.6666666667, ans=0.125 2023-10-14 09:14:17,304 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.811e+02 2.049e+02 2.442e+02 3.152e+02, threshold=4.099e+02, percent-clipped=0.0 2023-10-14 09:14:17,886 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.90 vs. limit=22.5 2023-10-14 09:14:39,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-10-14 09:14:52,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1662971.3333333333, ans=0.125 2023-10-14 09:15:19,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1663064.6666666667, ans=0.1 2023-10-14 09:15:22,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1663064.6666666667, ans=0.0 2023-10-14 09:15:23,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1663064.6666666667, ans=0.2 2023-10-14 09:15:24,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1663064.6666666667, ans=0.2 2023-10-14 09:15:27,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1663111.3333333333, ans=0.2 2023-10-14 09:15:28,405 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.86 vs. limit=10.0 2023-10-14 09:15:30,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.89 vs. limit=22.5 2023-10-14 09:15:42,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-10-14 09:15:42,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1663158.0, ans=0.125 2023-10-14 09:15:54,413 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.15 vs. limit=15.0 2023-10-14 09:16:04,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.799e+02 1.939e+02 2.088e+02 3.320e+02, threshold=3.879e+02, percent-clipped=0.0 2023-10-14 09:16:38,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1663391.3333333333, ans=0.125 2023-10-14 09:16:55,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.00 vs. limit=6.0 2023-10-14 09:17:04,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1663484.6666666667, ans=0.2 2023-10-14 09:17:17,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1663578.0, ans=0.05 2023-10-14 09:17:39,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1663671.3333333333, ans=0.025 2023-10-14 09:17:48,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1663718.0, ans=0.125 2023-10-14 09:17:53,964 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.754e+02 1.959e+02 2.118e+02 3.199e+02, threshold=3.918e+02, percent-clipped=0.0 2023-10-14 09:18:13,330 INFO [train.py:1031] (0/4) Epoch 27, batch 1500, loss[loss=0.2027, simple_loss=0.2809, pruned_loss=0.06224, over 15681.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2781, pruned_loss=0.04682, over 17287568.61 frames. ], batch size: 350, lr: 1.28e-03, grad_scale: 32.0 2023-10-14 09:18:22,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1663858.0, ans=0.0 2023-10-14 09:18:49,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1663951.3333333333, ans=0.125 2023-10-14 09:18:54,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1663998.0, ans=0.07 2023-10-14 09:19:04,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1664044.6666666667, ans=0.1 2023-10-14 09:19:16,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1664091.3333333333, ans=0.035 2023-10-14 09:19:27,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1664091.3333333333, ans=0.1 2023-10-14 09:19:30,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1664138.0, ans=0.025 2023-10-14 09:19:34,675 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.55 vs. limit=15.0 2023-10-14 09:19:45,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.782e+02 1.908e+02 2.102e+02 2.972e+02, threshold=3.817e+02, percent-clipped=0.0 2023-10-14 09:19:56,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1664231.3333333333, ans=0.125 2023-10-14 09:19:59,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1664278.0, ans=0.1 2023-10-14 09:20:02,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1664278.0, ans=0.125 2023-10-14 09:20:03,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-10-14 09:20:05,530 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-10-14 09:20:06,502 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-10-14 09:20:11,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1664278.0, ans=0.2 2023-10-14 09:20:14,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1664324.6666666667, ans=0.125 2023-10-14 09:20:15,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1664324.6666666667, ans=0.5 2023-10-14 09:20:19,655 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=12.0 2023-10-14 09:20:52,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1664464.6666666667, ans=0.125 2023-10-14 09:21:05,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1664511.3333333333, ans=0.125 2023-10-14 09:21:05,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1664511.3333333333, ans=0.125 2023-10-14 09:21:14,580 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=12.0 2023-10-14 09:21:15,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1664558.0, ans=0.0 2023-10-14 09:21:16,604 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.75 vs. limit=12.0 2023-10-14 09:21:28,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1664604.6666666667, ans=0.125 2023-10-14 09:21:32,904 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.48 vs. limit=15.0 2023-10-14 09:21:38,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.775e+02 1.904e+02 2.092e+02 2.505e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-14 09:21:51,906 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.68 vs. limit=6.0 2023-10-14 09:21:55,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1664744.6666666667, ans=0.125 2023-10-14 09:22:17,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1664838.0, ans=0.125 2023-10-14 09:22:19,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1664838.0, ans=0.0 2023-10-14 09:22:21,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1664838.0, ans=0.125 2023-10-14 09:22:25,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1664884.6666666667, ans=0.0 2023-10-14 09:22:27,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1664884.6666666667, ans=0.125 2023-10-14 09:22:28,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1664884.6666666667, ans=0.2 2023-10-14 09:22:42,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1664931.3333333333, ans=0.125 2023-10-14 09:22:43,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1664931.3333333333, ans=0.125 2023-10-14 09:23:27,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.840e+02 2.087e+02 2.357e+02 2.992e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-14 09:23:51,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1665211.3333333333, ans=0.1 2023-10-14 09:23:56,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1665258.0, ans=0.2 2023-10-14 09:24:09,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1665304.6666666667, ans=0.0 2023-10-14 09:24:44,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1665444.6666666667, ans=0.1 2023-10-14 09:24:47,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.85 vs. limit=15.0 2023-10-14 09:24:48,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1665444.6666666667, ans=0.125 2023-10-14 09:25:19,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.835e+02 1.988e+02 2.209e+02 3.411e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-14 09:25:38,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1665678.0, ans=0.2 2023-10-14 09:26:08,578 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=15.0 2023-10-14 09:26:09,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1665818.0, ans=0.2 2023-10-14 09:26:25,899 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2023-10-14 09:26:44,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1665911.3333333333, ans=0.2 2023-10-14 09:26:55,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1665958.0, ans=0.125 2023-10-14 09:27:04,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1666004.6666666667, ans=0.125 2023-10-14 09:27:11,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1666004.6666666667, ans=0.125 2023-10-14 09:27:18,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1666051.3333333333, ans=0.1 2023-10-14 09:27:21,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.823e+02 1.929e+02 2.193e+02 4.546e+02, threshold=3.858e+02, percent-clipped=1.0 2023-10-14 09:27:22,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1666051.3333333333, ans=0.125 2023-10-14 09:27:38,925 INFO [train.py:1031] (0/4) Epoch 27, batch 2000, loss[loss=0.2105, simple_loss=0.2719, pruned_loss=0.07457, over 12488.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2788, pruned_loss=0.04686, over 20747082.56 frames. ], batch size: 440, lr: 1.28e-03, grad_scale: 32.0 2023-10-14 09:27:44,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1666144.6666666667, ans=0.0 2023-10-14 09:27:52,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1666191.3333333333, ans=0.0 2023-10-14 09:28:03,829 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 09:28:10,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1666238.0, ans=0.1 2023-10-14 09:28:19,517 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=22.5 2023-10-14 09:28:27,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1666284.6666666667, ans=0.125 2023-10-14 09:28:37,910 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1666331.3333333333, ans=0.0 2023-10-14 09:28:56,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1666424.6666666667, ans=0.125 2023-10-14 09:29:12,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1666471.3333333333, ans=0.0 2023-10-14 09:29:15,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1666471.3333333333, ans=0.2 2023-10-14 09:29:17,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1666471.3333333333, ans=0.125 2023-10-14 09:29:23,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1666518.0, ans=0.2 2023-10-14 09:29:26,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.819e+02 2.043e+02 2.265e+02 3.390e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-14 09:29:54,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1666611.3333333333, ans=0.1 2023-10-14 09:30:00,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-10-14 09:30:06,087 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-10-14 09:30:31,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1666704.6666666667, ans=0.0 2023-10-14 09:30:42,337 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.70 vs. limit=10.0 2023-10-14 09:31:01,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1666798.0, ans=0.07 2023-10-14 09:31:03,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1666844.6666666667, ans=0.125 2023-10-14 09:31:21,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1666891.3333333333, ans=0.125 2023-10-14 09:31:21,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1666891.3333333333, ans=0.0 2023-10-14 09:31:47,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.834e+02 2.017e+02 2.244e+02 2.925e+02, threshold=4.034e+02, percent-clipped=0.0 2023-10-14 09:31:53,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1667031.3333333333, ans=0.1 2023-10-14 09:32:35,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1667218.0, ans=0.0 2023-10-14 09:32:37,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1667218.0, ans=0.0 2023-10-14 09:32:46,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1667264.6666666667, ans=0.0 2023-10-14 09:32:49,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.04 vs. limit=15.0 2023-10-14 09:32:59,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1667311.3333333333, ans=0.0 2023-10-14 09:33:01,071 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.781e-02 2023-10-14 09:33:13,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1667358.0, ans=0.125 2023-10-14 09:33:16,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1667404.6666666667, ans=0.125 2023-10-14 09:33:21,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1667404.6666666667, ans=0.025 2023-10-14 09:33:34,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1667451.3333333333, ans=0.125 2023-10-14 09:33:36,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.867e+02 2.008e+02 2.224e+02 2.702e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-14 09:34:12,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.76 vs. limit=15.0 2023-10-14 09:34:17,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1667638.0, ans=0.2 2023-10-14 09:34:24,841 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 09:34:25,133 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.72 vs. limit=15.0 2023-10-14 09:34:32,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1667731.3333333333, ans=0.0 2023-10-14 09:34:40,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-10-14 09:34:49,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1667778.0, ans=0.05 2023-10-14 09:35:01,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1667824.6666666667, ans=0.2 2023-10-14 09:35:12,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1667871.3333333333, ans=0.09899494936611666 2023-10-14 09:35:14,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1667871.3333333333, ans=0.125 2023-10-14 09:35:16,780 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.35 vs. limit=10.0 2023-10-14 09:35:20,749 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 09:35:21,991 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-10-14 09:35:26,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.936e+02 2.103e+02 2.370e+02 2.947e+02, threshold=4.207e+02, percent-clipped=0.0 2023-10-14 09:35:44,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1668011.3333333333, ans=0.0 2023-10-14 09:35:57,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1668058.0, ans=0.0 2023-10-14 09:36:00,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1668104.6666666667, ans=0.125 2023-10-14 09:37:09,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1668384.6666666667, ans=0.0 2023-10-14 09:37:12,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.855e+02 2.058e+02 2.310e+02 2.908e+02, threshold=4.116e+02, percent-clipped=0.0 2023-10-14 09:37:17,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1668431.3333333333, ans=0.0 2023-10-14 09:37:18,276 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.90 vs. limit=22.5 2023-10-14 09:37:21,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1668431.3333333333, ans=0.0 2023-10-14 09:37:22,049 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.41 vs. limit=10.0 2023-10-14 09:37:24,188 INFO [train.py:1031] (0/4) Epoch 27, batch 2500, loss[loss=0.1781, simple_loss=0.2766, pruned_loss=0.03982, over 16882.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2789, pruned_loss=0.04697, over 23397917.90 frames. ], batch size: 104, lr: 1.28e-03, grad_scale: 16.0 2023-10-14 09:37:25,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.60 vs. limit=6.0 2023-10-14 09:37:34,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=1668524.6666666667, ans=0.2 2023-10-14 09:37:39,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1668524.6666666667, ans=0.125 2023-10-14 09:37:43,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1668571.3333333333, ans=0.125 2023-10-14 09:37:55,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1668618.0, ans=0.125 2023-10-14 09:38:30,675 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.00 vs. limit=15.0 2023-10-14 09:38:44,989 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 09:38:46,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1668804.6666666667, ans=0.0 2023-10-14 09:38:58,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.897e+02 2.070e+02 2.304e+02 5.805e+02, threshold=4.141e+02, percent-clipped=1.0 2023-10-14 09:39:12,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1668944.6666666667, ans=0.125 2023-10-14 09:40:11,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1669178.0, ans=0.125 2023-10-14 09:40:14,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-10-14 09:40:19,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1669224.6666666667, ans=0.1 2023-10-14 09:40:21,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1669224.6666666667, ans=0.125 2023-10-14 09:40:23,147 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 09:40:31,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1669271.3333333333, ans=0.125 2023-10-14 09:40:34,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1669271.3333333333, ans=0.0 2023-10-14 09:40:36,788 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-10-14 09:40:37,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1669318.0, ans=0.125 2023-10-14 09:40:40,549 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=12.0 2023-10-14 09:40:44,682 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.821e+02 1.993e+02 2.187e+02 3.390e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-14 09:41:14,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1669458.0, ans=0.125 2023-10-14 09:41:37,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1669551.3333333333, ans=0.2 2023-10-14 09:41:39,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1669551.3333333333, ans=0.025 2023-10-14 09:42:32,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.96 vs. limit=15.0 2023-10-14 09:42:45,369 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.786e+02 1.964e+02 2.192e+02 3.131e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-14 09:42:53,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1669831.3333333333, ans=0.125 2023-10-14 09:43:05,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1669878.0, ans=0.125 2023-10-14 09:43:23,186 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.68 vs. limit=6.0 2023-10-14 09:43:42,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1670018.0, ans=0.0 2023-10-14 09:43:46,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1670018.0, ans=0.0 2023-10-14 09:43:53,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1670064.6666666667, ans=0.5 2023-10-14 09:44:06,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1670111.3333333333, ans=0.0 2023-10-14 09:44:35,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1670251.3333333333, ans=0.1 2023-10-14 09:44:46,096 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-14 09:44:46,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.820e+02 2.018e+02 2.273e+02 2.903e+02, threshold=4.036e+02, percent-clipped=0.0 2023-10-14 09:44:49,087 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.59 vs. limit=10.0 2023-10-14 09:45:16,252 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.00 vs. limit=15.0 2023-10-14 09:45:26,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1670438.0, ans=0.0 2023-10-14 09:45:39,743 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.07 vs. limit=15.0 2023-10-14 09:45:40,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1670484.6666666667, ans=0.2 2023-10-14 09:45:53,260 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-10-14 09:46:12,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1670624.6666666667, ans=0.125 2023-10-14 09:46:18,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1670624.6666666667, ans=0.0 2023-10-14 09:46:24,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1670671.3333333333, ans=0.0 2023-10-14 09:46:37,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1670718.0, ans=0.1 2023-10-14 09:46:41,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.811e+02 1.990e+02 2.206e+02 2.848e+02, threshold=3.981e+02, percent-clipped=0.0 2023-10-14 09:46:49,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1670764.6666666667, ans=0.0 2023-10-14 09:46:51,493 INFO [train.py:1031] (0/4) Epoch 27, batch 3000, loss[loss=0.1804, simple_loss=0.2452, pruned_loss=0.05782, over 12650.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2781, pruned_loss=0.04685, over 25483458.26 frames. ], batch size: 440, lr: 1.28e-03, grad_scale: 8.0 2023-10-14 09:46:53,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1670811.3333333333, ans=0.0 2023-10-14 09:46:59,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1670811.3333333333, ans=0.0 2023-10-14 09:47:15,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1670904.6666666667, ans=0.05 2023-10-14 09:47:16,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1670904.6666666667, ans=0.125 2023-10-14 09:47:43,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.49 vs. limit=15.0 2023-10-14 09:47:52,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1671044.6666666667, ans=0.125 2023-10-14 09:47:53,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1671044.6666666667, ans=0.125 2023-10-14 09:47:55,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1671044.6666666667, ans=0.2 2023-10-14 09:48:00,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1671091.3333333333, ans=0.125 2023-10-14 09:48:29,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1671184.6666666667, ans=0.125 2023-10-14 09:48:30,663 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.789e+02 1.948e+02 2.134e+02 2.677e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-14 09:48:33,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1671231.3333333333, ans=0.2 2023-10-14 09:48:33,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=22.5 2023-10-14 09:49:01,315 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.69 vs. limit=10.0 2023-10-14 09:49:07,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1671324.6666666667, ans=0.04949747468305833 2023-10-14 09:49:07,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1671324.6666666667, ans=0.125 2023-10-14 09:49:17,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1671371.3333333333, ans=0.125 2023-10-14 09:49:26,839 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.15 vs. limit=10.0 2023-10-14 09:49:45,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1671511.3333333333, ans=0.1 2023-10-14 09:49:49,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1671511.3333333333, ans=0.125 2023-10-14 09:50:00,550 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.58 vs. limit=6.0 2023-10-14 09:50:14,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1671604.6666666667, ans=0.125 2023-10-14 09:50:21,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-10-14 09:50:23,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1671651.3333333333, ans=0.125 2023-10-14 09:50:29,331 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.88 vs. limit=15.0 2023-10-14 09:50:29,693 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.787e+02 1.946e+02 2.089e+02 3.241e+02, threshold=3.891e+02, percent-clipped=0.0 2023-10-14 09:50:48,825 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.58 vs. limit=15.0 2023-10-14 09:51:16,165 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1671838.0, ans=0.0 2023-10-14 09:52:07,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1672024.6666666667, ans=0.125 2023-10-14 09:52:09,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1672024.6666666667, ans=0.125 2023-10-14 09:52:13,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1672071.3333333333, ans=0.2 2023-10-14 09:52:15,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1672071.3333333333, ans=0.1 2023-10-14 09:52:19,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1672071.3333333333, ans=0.125 2023-10-14 09:52:23,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1672071.3333333333, ans=0.125 2023-10-14 09:52:34,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1672118.0, ans=0.1 2023-10-14 09:52:37,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.815e+02 1.957e+02 2.178e+02 3.010e+02, threshold=3.915e+02, percent-clipped=0.0 2023-10-14 09:53:16,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1672304.6666666667, ans=0.0 2023-10-14 09:53:21,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1672304.6666666667, ans=0.125 2023-10-14 09:53:30,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.87 vs. limit=15.0 2023-10-14 09:53:34,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1672351.3333333333, ans=0.1 2023-10-14 09:53:41,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1672398.0, ans=0.2 2023-10-14 09:53:48,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1672444.6666666667, ans=0.125 2023-10-14 09:54:02,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1672491.3333333333, ans=0.125 2023-10-14 09:54:03,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1672491.3333333333, ans=0.1 2023-10-14 09:54:37,606 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1672584.6666666667, ans=0.125 2023-10-14 09:54:41,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.891e+02 2.068e+02 2.259e+02 3.172e+02, threshold=4.137e+02, percent-clipped=0.0 2023-10-14 09:54:50,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-10-14 09:55:05,943 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.68 vs. limit=15.0 2023-10-14 09:55:13,074 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.32 vs. limit=22.5 2023-10-14 09:55:29,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1672818.0, ans=0.125 2023-10-14 09:55:41,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1672864.6666666667, ans=0.04949747468305833 2023-10-14 09:56:05,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=1672958.0, ans=0.025 2023-10-14 09:56:25,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1673004.6666666667, ans=0.0 2023-10-14 09:56:35,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1673051.3333333333, ans=0.0 2023-10-14 09:56:39,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 1.855e+02 2.002e+02 2.190e+02 2.978e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-14 09:56:48,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1673098.0, ans=0.125 2023-10-14 09:56:53,390 INFO [train.py:1031] (0/4) Epoch 27, batch 3500, loss[loss=0.185, simple_loss=0.2738, pruned_loss=0.04807, over 15634.00 frames. ], tot_loss[loss=0.186, simple_loss=0.278, pruned_loss=0.04696, over 27104157.79 frames. ], batch size: 35, lr: 1.28e-03, grad_scale: 16.0 2023-10-14 09:56:56,731 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1673144.6666666667, ans=0.2 2023-10-14 09:57:09,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.84 vs. limit=15.0 2023-10-14 09:57:45,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1673331.3333333333, ans=0.0 2023-10-14 09:58:04,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1673378.0, ans=0.2 2023-10-14 09:58:52,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.966e+02 2.149e+02 2.459e+02 4.602e+02, threshold=4.298e+02, percent-clipped=1.0 2023-10-14 09:58:54,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1673564.6666666667, ans=0.125 2023-10-14 09:59:12,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1673611.3333333333, ans=0.0 2023-10-14 09:59:35,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.69 vs. limit=22.5 2023-10-14 09:59:41,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1673704.6666666667, ans=0.0 2023-10-14 09:59:46,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2.whitening_limit, batch_count=1673704.6666666667, ans=15.0 2023-10-14 09:59:52,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=1673751.3333333333, ans=0.02 2023-10-14 10:00:40,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1673938.0, ans=0.125 2023-10-14 10:00:53,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1673984.6666666667, ans=0.0 2023-10-14 10:00:54,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1673984.6666666667, ans=0.125 2023-10-14 10:01:00,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.806e+02 1.973e+02 2.278e+02 3.650e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-14 10:01:01,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1674031.3333333333, ans=0.07 2023-10-14 10:01:11,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1674031.3333333333, ans=0.125 2023-10-14 10:01:20,135 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.40 vs. limit=15.0 2023-10-14 10:01:22,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1674078.0, ans=0.0 2023-10-14 10:01:26,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1674124.6666666667, ans=0.0 2023-10-14 10:01:50,434 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-10-14 10:01:53,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1674218.0, ans=0.125 2023-10-14 10:02:08,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1674264.6666666667, ans=15.0 2023-10-14 10:02:25,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1674311.3333333333, ans=0.1 2023-10-14 10:02:47,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=1674404.6666666667, ans=0.2 2023-10-14 10:03:06,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.774e+02 1.998e+02 2.172e+02 2.795e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-14 10:03:12,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1674498.0, ans=0.2 2023-10-14 10:03:14,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1674498.0, ans=0.025 2023-10-14 10:03:31,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-10-14 10:03:52,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1674638.0, ans=0.0 2023-10-14 10:03:52,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1674638.0, ans=0.1 2023-10-14 10:03:56,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1674684.6666666667, ans=0.125 2023-10-14 10:04:05,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1674684.6666666667, ans=0.125 2023-10-14 10:04:24,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1674778.0, ans=0.015 2023-10-14 10:04:26,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1674778.0, ans=0.125 2023-10-14 10:04:33,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1674824.6666666667, ans=0.1 2023-10-14 10:04:39,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.21 vs. limit=6.0 2023-10-14 10:04:45,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1674824.6666666667, ans=0.0 2023-10-14 10:04:59,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1674871.3333333333, ans=0.125 2023-10-14 10:05:08,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1674918.0, ans=0.07 2023-10-14 10:05:14,462 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.746e+02 1.875e+02 2.230e+02 2.957e+02, threshold=3.750e+02, percent-clipped=0.0 2023-10-14 10:05:16,559 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1674964.6666666667, ans=0.0 2023-10-14 10:05:20,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1674964.6666666667, ans=0.1 2023-10-14 10:05:49,999 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-10-14 10:06:07,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1675151.3333333333, ans=0.1 2023-10-14 10:06:17,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1675198.0, ans=0.125 2023-10-14 10:06:22,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1675244.6666666667, ans=0.025 2023-10-14 10:06:23,500 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-10-14 10:06:32,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1675244.6666666667, ans=0.2 2023-10-14 10:06:44,464 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1675291.3333333333, ans=0.125 2023-10-14 10:06:45,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1675291.3333333333, ans=0.125 2023-10-14 10:06:54,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1675338.0, ans=0.125 2023-10-14 10:06:54,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1675338.0, ans=0.2 2023-10-14 10:07:12,197 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.762e+02 1.935e+02 2.171e+02 2.801e+02, threshold=3.871e+02, percent-clipped=0.0 2023-10-14 10:07:22,795 INFO [train.py:1031] (0/4) Epoch 27, batch 4000, loss[loss=0.1929, simple_loss=0.2905, pruned_loss=0.04762, over 16859.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2776, pruned_loss=0.04707, over 28348964.64 frames. ], batch size: 110, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 10:07:46,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1675524.6666666667, ans=0.125 2023-10-14 10:07:49,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1675571.3333333333, ans=0.125 2023-10-14 10:07:49,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1675571.3333333333, ans=0.125 2023-10-14 10:07:50,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1675571.3333333333, ans=0.025 2023-10-14 10:08:10,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1675618.0, ans=0.125 2023-10-14 10:09:00,188 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-10-14 10:09:07,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1675851.3333333333, ans=0.2 2023-10-14 10:09:08,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1675851.3333333333, ans=0.0 2023-10-14 10:09:09,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1675851.3333333333, ans=0.0 2023-10-14 10:09:10,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1675851.3333333333, ans=0.125 2023-10-14 10:09:16,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1675898.0, ans=0.0 2023-10-14 10:09:17,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.869e+02 2.046e+02 2.247e+02 3.139e+02, threshold=4.092e+02, percent-clipped=0.0 2023-10-14 10:09:20,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1675898.0, ans=0.1 2023-10-14 10:09:22,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1675898.0, ans=0.125 2023-10-14 10:09:38,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1675991.3333333333, ans=0.0 2023-10-14 10:09:41,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1675991.3333333333, ans=0.1 2023-10-14 10:09:41,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1675991.3333333333, ans=0.1 2023-10-14 10:09:46,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1675991.3333333333, ans=0.0 2023-10-14 10:09:52,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1676038.0, ans=0.125 2023-10-14 10:10:11,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1676084.6666666667, ans=0.0 2023-10-14 10:10:18,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1676131.3333333333, ans=0.0 2023-10-14 10:10:36,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1676178.0, ans=0.0 2023-10-14 10:10:42,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1676224.6666666667, ans=0.0 2023-10-14 10:10:44,738 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.31 vs. limit=15.0 2023-10-14 10:11:02,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1676271.3333333333, ans=0.125 2023-10-14 10:11:03,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1676271.3333333333, ans=0.0 2023-10-14 10:11:03,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1676271.3333333333, ans=0.0 2023-10-14 10:11:21,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=12.0 2023-10-14 10:11:24,786 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-10-14 10:11:31,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.822e+02 1.949e+02 2.140e+02 3.283e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-14 10:11:33,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1676364.6666666667, ans=0.125 2023-10-14 10:12:02,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1676458.0, ans=0.125 2023-10-14 10:12:02,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1676458.0, ans=0.125 2023-10-14 10:12:12,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1676504.6666666667, ans=0.125 2023-10-14 10:12:22,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676551.3333333333, ans=0.1 2023-10-14 10:12:32,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1676598.0, ans=0.0 2023-10-14 10:12:52,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1676644.6666666667, ans=0.125 2023-10-14 10:12:52,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676644.6666666667, ans=0.1 2023-10-14 10:12:53,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1676644.6666666667, ans=0.125 2023-10-14 10:13:10,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1676738.0, ans=0.125 2023-10-14 10:13:23,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1676784.6666666667, ans=0.2 2023-10-14 10:13:28,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1676784.6666666667, ans=0.2 2023-10-14 10:13:36,272 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.827e+02 1.988e+02 2.160e+02 2.795e+02, threshold=3.976e+02, percent-clipped=0.0 2023-10-14 10:13:42,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1676831.3333333333, ans=0.125 2023-10-14 10:13:59,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.64 vs. limit=10.0 2023-10-14 10:14:04,434 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.61 vs. limit=6.0 2023-10-14 10:14:14,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1676971.3333333333, ans=0.125 2023-10-14 10:14:20,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1677018.0, ans=0.125 2023-10-14 10:14:23,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1677018.0, ans=0.0 2023-10-14 10:14:33,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1677064.6666666667, ans=0.125 2023-10-14 10:14:37,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1677064.6666666667, ans=0.125 2023-10-14 10:14:47,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1677111.3333333333, ans=0.1 2023-10-14 10:14:53,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1677111.3333333333, ans=0.0 2023-10-14 10:15:38,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 1.862e+02 1.982e+02 2.124e+02 2.987e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-14 10:15:44,011 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.32 vs. limit=15.0 2023-10-14 10:15:44,081 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-10-14 10:15:46,079 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=12.0 2023-10-14 10:15:58,025 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.71 vs. limit=12.0 2023-10-14 10:16:19,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1677438.0, ans=0.125 2023-10-14 10:16:20,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1677438.0, ans=0.125 2023-10-14 10:16:20,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1677438.0, ans=0.0 2023-10-14 10:16:27,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1677484.6666666667, ans=0.05 2023-10-14 10:16:32,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.38 vs. limit=10.0 2023-10-14 10:16:33,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1677484.6666666667, ans=0.1 2023-10-14 10:16:33,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1677484.6666666667, ans=0.125 2023-10-14 10:16:36,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1677484.6666666667, ans=0.5 2023-10-14 10:17:04,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1677578.0, ans=0.0 2023-10-14 10:17:12,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1677624.6666666667, ans=0.125 2023-10-14 10:17:30,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1677671.3333333333, ans=0.125 2023-10-14 10:17:50,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1677718.0, ans=0.125 2023-10-14 10:17:54,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.827e+02 2.002e+02 2.289e+02 4.093e+02, threshold=4.005e+02, percent-clipped=1.0 2023-10-14 10:18:04,800 INFO [train.py:1031] (0/4) Epoch 27, batch 4500, loss[loss=0.1806, simple_loss=0.2712, pruned_loss=0.045, over 15680.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2779, pruned_loss=0.04689, over 29359377.35 frames. ], batch size: 35, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 10:18:35,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-10-14 10:18:36,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1677904.6666666667, ans=0.125 2023-10-14 10:18:42,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1677951.3333333333, ans=0.0 2023-10-14 10:18:57,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1677998.0, ans=0.0 2023-10-14 10:19:07,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1678044.6666666667, ans=0.125 2023-10-14 10:19:21,541 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:19:49,101 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.770e+02 1.916e+02 2.101e+02 2.914e+02, threshold=3.832e+02, percent-clipped=0.0 2023-10-14 10:20:15,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1678324.6666666667, ans=0.0 2023-10-14 10:20:17,280 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.69 vs. limit=15.0 2023-10-14 10:20:23,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1678371.3333333333, ans=0.125 2023-10-14 10:21:00,463 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1678511.3333333333, ans=0.0 2023-10-14 10:21:08,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1678558.0, ans=0.05 2023-10-14 10:21:11,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678558.0, ans=0.1 2023-10-14 10:21:12,460 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.89 vs. limit=15.0 2023-10-14 10:21:35,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0 2023-10-14 10:21:41,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1678651.3333333333, ans=0.125 2023-10-14 10:21:45,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.420e+02 1.886e+02 2.008e+02 2.244e+02 3.092e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-14 10:21:51,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1678698.0, ans=0.0 2023-10-14 10:23:05,926 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.37 vs. limit=22.5 2023-10-14 10:23:06,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-10-14 10:23:11,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679024.6666666667, ans=0.1 2023-10-14 10:23:18,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1679071.3333333333, ans=0.125 2023-10-14 10:23:23,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1679071.3333333333, ans=0.2 2023-10-14 10:23:29,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1679118.0, ans=0.1 2023-10-14 10:23:53,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.779e+02 1.925e+02 2.109e+02 2.801e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-14 10:23:58,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=15.0 2023-10-14 10:24:07,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1679211.3333333333, ans=0.1 2023-10-14 10:24:10,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1679211.3333333333, ans=0.0 2023-10-14 10:24:52,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1679351.3333333333, ans=0.2 2023-10-14 10:25:24,674 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-10-14 10:25:33,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1679538.0, ans=0.125 2023-10-14 10:25:37,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1679538.0, ans=0.125 2023-10-14 10:25:37,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.34 vs. limit=22.5 2023-10-14 10:25:39,755 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.85 vs. limit=12.0 2023-10-14 10:26:00,320 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.909e+02 2.031e+02 2.243e+02 3.076e+02, threshold=4.063e+02, percent-clipped=0.0 2023-10-14 10:26:19,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1679678.0, ans=0.1 2023-10-14 10:26:22,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1679724.6666666667, ans=0.125 2023-10-14 10:26:27,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1679724.6666666667, ans=0.0 2023-10-14 10:26:41,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.32 vs. limit=15.0 2023-10-14 10:26:49,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679818.0, ans=0.1 2023-10-14 10:27:36,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.07 vs. limit=10.0 2023-10-14 10:27:42,864 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-360000.pt 2023-10-14 10:28:05,960 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1680051.3333333333, ans=0.125 2023-10-14 10:28:06,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1680051.3333333333, ans=0.125 2023-10-14 10:28:23,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.833e+02 1.976e+02 2.157e+02 3.051e+02, threshold=3.952e+02, percent-clipped=0.0 2023-10-14 10:28:26,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1680098.0, ans=0.1 2023-10-14 10:28:32,660 INFO [train.py:1031] (0/4) Epoch 27, batch 5000, loss[loss=0.1843, simple_loss=0.281, pruned_loss=0.04379, over 16831.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2776, pruned_loss=0.04685, over 30115366.05 frames. ], batch size: 98, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 10:28:36,523 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-10-14 10:28:58,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1680191.3333333333, ans=0.0 2023-10-14 10:28:58,935 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-10-14 10:29:01,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1680238.0, ans=0.0 2023-10-14 10:29:01,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1680238.0, ans=0.125 2023-10-14 10:29:03,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1680238.0, ans=0.0 2023-10-14 10:29:14,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1680284.6666666667, ans=0.125 2023-10-14 10:29:32,702 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-10-14 10:29:46,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.49 vs. limit=15.0 2023-10-14 10:30:02,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1680471.3333333333, ans=0.0 2023-10-14 10:30:30,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.833e+02 1.992e+02 2.209e+02 2.946e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-14 10:31:02,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1680658.0, ans=0.2 2023-10-14 10:31:11,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1680704.6666666667, ans=0.05 2023-10-14 10:31:19,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1680751.3333333333, ans=0.125 2023-10-14 10:31:51,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1680844.6666666667, ans=0.0 2023-10-14 10:31:54,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1680844.6666666667, ans=0.125 2023-10-14 10:32:16,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1680938.0, ans=0.125 2023-10-14 10:32:43,398 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-10-14 10:32:47,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1681031.3333333333, ans=0.0 2023-10-14 10:32:49,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.819e+02 2.030e+02 2.220e+02 3.325e+02, threshold=4.060e+02, percent-clipped=0.0 2023-10-14 10:32:51,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1681031.3333333333, ans=0.125 2023-10-14 10:32:52,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1681031.3333333333, ans=0.1 2023-10-14 10:33:15,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1681124.6666666667, ans=0.0 2023-10-14 10:33:27,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1681124.6666666667, ans=0.125 2023-10-14 10:33:40,379 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1681171.3333333333, ans=0.125 2023-10-14 10:34:06,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1681311.3333333333, ans=0.125 2023-10-14 10:34:24,283 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-10-14 10:34:37,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1681404.6666666667, ans=0.125 2023-10-14 10:34:50,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1681451.3333333333, ans=0.125 2023-10-14 10:34:55,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1681451.3333333333, ans=0.5 2023-10-14 10:35:03,852 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:35:10,568 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.866e+02 2.075e+02 2.287e+02 3.292e+02, threshold=4.150e+02, percent-clipped=0.0 2023-10-14 10:35:19,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1681544.6666666667, ans=0.125 2023-10-14 10:35:23,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1681544.6666666667, ans=0.125 2023-10-14 10:35:40,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1681591.3333333333, ans=0.125 2023-10-14 10:36:06,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1681684.6666666667, ans=0.125 2023-10-14 10:36:16,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1681731.3333333333, ans=0.125 2023-10-14 10:36:29,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1681778.0, ans=0.2 2023-10-14 10:36:31,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1681778.0, ans=0.2 2023-10-14 10:36:32,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1681778.0, ans=0.2 2023-10-14 10:36:40,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1681824.6666666667, ans=0.2 2023-10-14 10:36:54,311 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2023-10-14 10:37:16,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1681918.0, ans=0.0 2023-10-14 10:37:26,263 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.763e+02 1.890e+02 2.133e+02 2.998e+02, threshold=3.780e+02, percent-clipped=0.0 2023-10-14 10:37:40,269 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-10-14 10:38:39,867 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-10-14 10:38:55,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1682244.6666666667, ans=0.125 2023-10-14 10:39:01,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1682244.6666666667, ans=0.2 2023-10-14 10:39:51,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.813e+02 1.997e+02 2.209e+02 2.946e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-14 10:39:53,276 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1682431.3333333333, ans=0.125 2023-10-14 10:39:57,049 INFO [train.py:1031] (0/4) Epoch 27, batch 5500, loss[loss=0.1927, simple_loss=0.285, pruned_loss=0.05014, over 16536.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2775, pruned_loss=0.04679, over 30710210.59 frames. ], batch size: 266, lr: 1.27e-03, grad_scale: 16.0 2023-10-14 10:39:59,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1682478.0, ans=0.125 2023-10-14 10:40:06,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1682478.0, ans=0.0 2023-10-14 10:40:12,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1682524.6666666667, ans=0.125 2023-10-14 10:40:24,783 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.00 vs. limit=15.0 2023-10-14 10:40:27,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1682571.3333333333, ans=0.07 2023-10-14 10:40:53,115 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-10-14 10:41:06,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1682758.0, ans=0.1 2023-10-14 10:41:18,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.69 vs. limit=15.0 2023-10-14 10:41:40,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1682851.3333333333, ans=0.125 2023-10-14 10:41:40,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1682851.3333333333, ans=0.2 2023-10-14 10:41:51,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.740e+02 1.902e+02 2.192e+02 2.995e+02, threshold=3.805e+02, percent-clipped=0.0 2023-10-14 10:42:31,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1683038.0, ans=0.1 2023-10-14 10:42:46,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1683084.6666666667, ans=0.125 2023-10-14 10:42:49,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1683084.6666666667, ans=0.05 2023-10-14 10:42:49,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1683084.6666666667, ans=0.125 2023-10-14 10:43:16,955 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.97 vs. limit=15.0 2023-10-14 10:43:31,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1683224.6666666667, ans=0.125 2023-10-14 10:43:34,811 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:43:41,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1683271.3333333333, ans=0.125 2023-10-14 10:43:53,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1683318.0, ans=0.0 2023-10-14 10:44:04,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1683318.0, ans=0.125 2023-10-14 10:44:14,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.833e+02 2.004e+02 2.154e+02 3.075e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-14 10:44:19,002 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.45 vs. limit=15.0 2023-10-14 10:44:43,460 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-10-14 10:45:02,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1683551.3333333333, ans=0.1 2023-10-14 10:45:26,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1683644.6666666667, ans=0.05 2023-10-14 10:45:28,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1683644.6666666667, ans=0.125 2023-10-14 10:46:11,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1683784.6666666667, ans=0.1 2023-10-14 10:46:16,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1683831.3333333333, ans=0.0 2023-10-14 10:46:22,363 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.807e+02 1.967e+02 2.206e+02 2.804e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-14 10:46:23,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1683831.3333333333, ans=0.0 2023-10-14 10:46:27,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1683878.0, ans=0.0 2023-10-14 10:46:34,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1683878.0, ans=0.05 2023-10-14 10:47:07,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1683971.3333333333, ans=0.125 2023-10-14 10:47:09,752 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:47:15,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1684018.0, ans=0.1 2023-10-14 10:47:58,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1684111.3333333333, ans=0.125 2023-10-14 10:48:24,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1684204.6666666667, ans=0.125 2023-10-14 10:48:50,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1684298.0, ans=0.09899494936611666 2023-10-14 10:48:55,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.742e+02 1.925e+02 2.130e+02 2.850e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-14 10:49:04,498 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.93 vs. limit=15.0 2023-10-14 10:49:28,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1684438.0, ans=0.2 2023-10-14 10:49:29,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1684438.0, ans=0.1 2023-10-14 10:49:31,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1684438.0, ans=0.1 2023-10-14 10:49:48,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1684484.6666666667, ans=0.0 2023-10-14 10:49:49,927 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1684484.6666666667, ans=0.2 2023-10-14 10:49:54,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1684484.6666666667, ans=0.0 2023-10-14 10:50:00,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1684531.3333333333, ans=0.1 2023-10-14 10:50:21,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-10-14 10:50:26,146 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.58 vs. limit=15.0 2023-10-14 10:50:31,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1684624.6666666667, ans=0.0 2023-10-14 10:51:05,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.384e+02 1.837e+02 1.998e+02 2.184e+02 2.794e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-14 10:51:07,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1684764.6666666667, ans=0.125 2023-10-14 10:51:11,221 INFO [train.py:1031] (0/4) Epoch 27, batch 6000, loss[loss=0.1888, simple_loss=0.2577, pruned_loss=0.05988, over 12349.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2778, pruned_loss=0.04693, over 31193899.80 frames. ], batch size: 440, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 10:51:11,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1684811.3333333333, ans=0.0 2023-10-14 10:51:15,876 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.30 vs. limit=15.0 2023-10-14 10:51:24,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1684858.0, ans=0.125 2023-10-14 10:51:56,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1684951.3333333333, ans=0.0 2023-10-14 10:52:46,356 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.66 vs. limit=22.5 2023-10-14 10:52:51,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1685138.0, ans=0.125 2023-10-14 10:52:51,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1685138.0, ans=0.125 2023-10-14 10:52:56,955 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.85 vs. limit=15.0 2023-10-14 10:53:03,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1685184.6666666667, ans=0.2 2023-10-14 10:53:17,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.808e+02 1.970e+02 2.237e+02 2.992e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-14 10:53:17,866 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 10:53:35,290 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=15.0 2023-10-14 10:53:56,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.30 vs. limit=15.0 2023-10-14 10:54:06,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=12.0 2023-10-14 10:54:22,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1685464.6666666667, ans=0.125 2023-10-14 10:54:28,039 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.58 vs. limit=10.0 2023-10-14 10:54:49,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1685511.3333333333, ans=0.125 2023-10-14 10:54:52,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1685558.0, ans=0.2 2023-10-14 10:55:16,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1685604.6666666667, ans=0.0 2023-10-14 10:55:19,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.58 vs. limit=15.0 2023-10-14 10:55:40,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.886e+02 2.058e+02 2.244e+02 3.385e+02, threshold=4.117e+02, percent-clipped=0.0 2023-10-14 10:56:34,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1685884.6666666667, ans=0.125 2023-10-14 10:56:38,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1685931.3333333333, ans=0.125 2023-10-14 10:56:41,156 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.80 vs. limit=10.0 2023-10-14 10:56:44,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.86 vs. limit=22.5 2023-10-14 10:56:46,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1685931.3333333333, ans=0.0 2023-10-14 10:56:53,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1685978.0, ans=0.0 2023-10-14 10:57:05,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1686024.6666666667, ans=0.125 2023-10-14 10:57:20,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1686071.3333333333, ans=0.0 2023-10-14 10:57:24,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1686071.3333333333, ans=0.0 2023-10-14 10:57:29,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1686071.3333333333, ans=0.0 2023-10-14 10:57:52,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.852e+02 1.962e+02 2.185e+02 3.142e+02, threshold=3.924e+02, percent-clipped=0.0 2023-10-14 10:57:52,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1686164.6666666667, ans=0.125 2023-10-14 10:59:21,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1686444.6666666667, ans=0.125 2023-10-14 11:00:14,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1686584.6666666667, ans=0.2 2023-10-14 11:00:25,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.827e+02 1.947e+02 2.089e+02 2.974e+02, threshold=3.893e+02, percent-clipped=0.0 2023-10-14 11:00:28,852 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.84 vs. limit=15.0 2023-10-14 11:00:28,865 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.24 vs. limit=15.0 2023-10-14 11:00:35,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-10-14 11:00:37,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1686678.0, ans=0.0 2023-10-14 11:00:37,842 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2023-10-14 11:01:14,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=22.5 2023-10-14 11:01:19,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1686818.0, ans=0.0 2023-10-14 11:01:28,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1686864.6666666667, ans=0.125 2023-10-14 11:01:31,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1686864.6666666667, ans=0.125 2023-10-14 11:01:32,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1686864.6666666667, ans=0.2 2023-10-14 11:01:38,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1686911.3333333333, ans=0.0 2023-10-14 11:01:44,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-10-14 11:01:48,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1686911.3333333333, ans=0.0 2023-10-14 11:01:56,653 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.49 vs. limit=15.0 2023-10-14 11:01:58,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1686958.0, ans=0.1 2023-10-14 11:02:08,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1687004.6666666667, ans=0.0 2023-10-14 11:02:27,021 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.56 vs. limit=15.0 2023-10-14 11:02:31,050 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.60 vs. limit=15.0 2023-10-14 11:02:31,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1687051.3333333333, ans=0.07 2023-10-14 11:02:41,487 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.773e+02 2.026e+02 2.262e+02 3.217e+02, threshold=4.051e+02, percent-clipped=0.0 2023-10-14 11:02:47,176 INFO [train.py:1031] (0/4) Epoch 27, batch 6500, loss[loss=0.1896, simple_loss=0.2922, pruned_loss=0.04352, over 16860.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2784, pruned_loss=0.0471, over 31551660.43 frames. ], batch size: 146, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 11:03:02,068 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.22 vs. limit=15.0 2023-10-14 11:03:10,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1687191.3333333333, ans=0.0 2023-10-14 11:03:20,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1687238.0, ans=0.125 2023-10-14 11:04:37,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1687424.6666666667, ans=0.0 2023-10-14 11:05:06,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1687518.0, ans=0.0 2023-10-14 11:05:10,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1687518.0, ans=0.0 2023-10-14 11:05:18,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1687518.0, ans=0.125 2023-10-14 11:05:20,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1687518.0, ans=0.015 2023-10-14 11:05:21,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1687518.0, ans=0.125 2023-10-14 11:05:37,060 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.931e+02 2.079e+02 2.326e+02 2.972e+02, threshold=4.158e+02, percent-clipped=0.0 2023-10-14 11:05:47,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687611.3333333333, ans=0.1 2023-10-14 11:06:13,867 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-10-14 11:06:19,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1687704.6666666667, ans=0.0 2023-10-14 11:06:22,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687704.6666666667, ans=0.1 2023-10-14 11:06:28,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687704.6666666667, ans=0.1 2023-10-14 11:07:00,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1687844.6666666667, ans=0.125 2023-10-14 11:07:02,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1687844.6666666667, ans=0.0 2023-10-14 11:07:17,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1687891.3333333333, ans=0.0 2023-10-14 11:07:26,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1687938.0, ans=10.0 2023-10-14 11:07:38,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687984.6666666667, ans=0.1 2023-10-14 11:07:55,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1688031.3333333333, ans=0.1 2023-10-14 11:07:58,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.830e+02 1.997e+02 2.217e+02 3.260e+02, threshold=3.993e+02, percent-clipped=0.0 2023-10-14 11:08:08,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1688078.0, ans=0.125 2023-10-14 11:08:12,261 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.07 vs. limit=15.0 2023-10-14 11:08:15,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1688124.6666666667, ans=0.125 2023-10-14 11:08:37,297 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:08:39,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1688218.0, ans=0.0 2023-10-14 11:08:40,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1688218.0, ans=0.0 2023-10-14 11:08:58,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1688264.6666666667, ans=0.0 2023-10-14 11:09:15,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1688311.3333333333, ans=0.125 2023-10-14 11:09:25,779 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-10-14 11:09:41,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1688404.6666666667, ans=0.0 2023-10-14 11:09:45,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1688404.6666666667, ans=0.125 2023-10-14 11:10:18,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.742e+02 1.976e+02 2.183e+02 4.121e+02, threshold=3.952e+02, percent-clipped=1.0 2023-10-14 11:10:25,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1688544.6666666667, ans=0.125 2023-10-14 11:11:05,198 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.92 vs. limit=22.5 2023-10-14 11:11:09,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1688638.0, ans=0.0 2023-10-14 11:11:18,163 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:11:20,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1688684.6666666667, ans=0.0 2023-10-14 11:11:28,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1688684.6666666667, ans=0.0 2023-10-14 11:11:32,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.12 vs. limit=10.0 2023-10-14 11:12:01,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-10-14 11:12:18,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1688824.6666666667, ans=0.125 2023-10-14 11:12:28,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1688871.3333333333, ans=0.0 2023-10-14 11:12:58,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1688964.6666666667, ans=0.0 2023-10-14 11:13:02,032 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.798e+02 1.919e+02 2.145e+02 3.215e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-14 11:13:03,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1688964.6666666667, ans=0.0 2023-10-14 11:13:03,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1688964.6666666667, ans=0.0 2023-10-14 11:13:04,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1689011.3333333333, ans=0.125 2023-10-14 11:13:15,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1689011.3333333333, ans=0.0 2023-10-14 11:13:22,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1689058.0, ans=0.0 2023-10-14 11:13:45,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1689151.3333333333, ans=0.04949747468305833 2023-10-14 11:13:45,213 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2023-10-14 11:13:51,607 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-10-14 11:14:11,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1689244.6666666667, ans=0.125 2023-10-14 11:14:27,272 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.50 vs. limit=22.5 2023-10-14 11:14:29,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1689291.3333333333, ans=0.0 2023-10-14 11:14:41,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1689338.0, ans=0.0 2023-10-14 11:14:47,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1689338.0, ans=0.0 2023-10-14 11:15:15,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.885e+02 2.002e+02 2.279e+02 3.290e+02, threshold=4.004e+02, percent-clipped=0.0 2023-10-14 11:15:16,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1689431.3333333333, ans=0.2 2023-10-14 11:15:17,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1689478.0, ans=0.125 2023-10-14 11:15:18,602 INFO [train.py:1031] (0/4) Epoch 27, batch 7000, loss[loss=0.1902, simple_loss=0.2797, pruned_loss=0.05034, over 16824.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.279, pruned_loss=0.04712, over 31843533.24 frames. ], batch size: 175, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 11:15:24,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1689478.0, ans=0.125 2023-10-14 11:15:26,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1689478.0, ans=0.2 2023-10-14 11:15:26,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1689478.0, ans=0.125 2023-10-14 11:15:42,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1689524.6666666667, ans=0.2 2023-10-14 11:15:49,266 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:15:49,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=1689571.3333333333, ans=10.0 2023-10-14 11:15:59,059 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.32 vs. limit=15.0 2023-10-14 11:16:06,957 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-14 11:16:15,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1689618.0, ans=0.125 2023-10-14 11:16:29,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1689664.6666666667, ans=0.0 2023-10-14 11:16:30,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-10-14 11:16:35,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1689664.6666666667, ans=0.0 2023-10-14 11:17:01,697 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2023-10-14 11:17:22,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1689804.6666666667, ans=0.125 2023-10-14 11:17:29,461 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.52 vs. limit=15.0 2023-10-14 11:17:40,001 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:17:58,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.895e+02 2.043e+02 2.334e+02 3.118e+02, threshold=4.085e+02, percent-clipped=0.0 2023-10-14 11:18:17,041 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.81 vs. limit=15.0 2023-10-14 11:18:29,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1689991.3333333333, ans=0.125 2023-10-14 11:18:51,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1690084.6666666667, ans=0.1 2023-10-14 11:19:28,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1690178.0, ans=0.1 2023-10-14 11:19:32,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1690224.6666666667, ans=0.1 2023-10-14 11:19:49,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1690271.3333333333, ans=0.125 2023-10-14 11:19:54,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1690271.3333333333, ans=0.05 2023-10-14 11:20:15,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1690318.0, ans=0.125 2023-10-14 11:20:27,584 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.810e+02 2.052e+02 2.313e+02 3.310e+02, threshold=4.104e+02, percent-clipped=0.0 2023-10-14 11:20:27,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1690364.6666666667, ans=0.125 2023-10-14 11:20:47,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1690411.3333333333, ans=0.1 2023-10-14 11:21:05,407 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.08 vs. limit=15.0 2023-10-14 11:21:12,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1690458.0, ans=0.2 2023-10-14 11:21:29,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1690504.6666666667, ans=0.2 2023-10-14 11:21:37,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1690551.3333333333, ans=0.0 2023-10-14 11:21:41,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1690551.3333333333, ans=0.0 2023-10-14 11:22:00,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1690598.0, ans=0.125 2023-10-14 11:22:01,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1690598.0, ans=0.125 2023-10-14 11:22:07,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1690598.0, ans=0.125 2023-10-14 11:23:30,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1690784.6666666667, ans=0.0 2023-10-14 11:23:48,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.845e+02 2.084e+02 2.364e+02 3.269e+02, threshold=4.168e+02, percent-clipped=0.0 2023-10-14 11:23:58,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1690878.0, ans=0.0 2023-10-14 11:24:12,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1690924.6666666667, ans=0.0 2023-10-14 11:24:12,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1690924.6666666667, ans=0.0 2023-10-14 11:24:23,218 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:24:35,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1690971.3333333333, ans=0.125 2023-10-14 11:25:15,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1691064.6666666667, ans=0.125 2023-10-14 11:25:27,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-10-14 11:25:42,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1691158.0, ans=0.125 2023-10-14 11:25:53,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1691158.0, ans=0.125 2023-10-14 11:25:53,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1691158.0, ans=0.07 2023-10-14 11:26:01,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1691204.6666666667, ans=0.125 2023-10-14 11:26:04,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1691204.6666666667, ans=0.0 2023-10-14 11:26:15,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1691204.6666666667, ans=0.0 2023-10-14 11:26:19,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1691204.6666666667, ans=0.1 2023-10-14 11:27:00,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.814e+02 2.058e+02 2.236e+02 3.070e+02, threshold=4.115e+02, percent-clipped=0.0 2023-10-14 11:27:01,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1691298.0, ans=0.0 2023-10-14 11:27:18,812 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-10-14 11:27:33,347 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.60 vs. limit=12.0 2023-10-14 11:27:37,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1691391.3333333333, ans=0.125 2023-10-14 11:27:50,893 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:27:58,309 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:29:59,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1691718.0, ans=0.0 2023-10-14 11:30:00,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1691718.0, ans=0.125 2023-10-14 11:30:20,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.863e+02 2.070e+02 2.325e+02 3.365e+02, threshold=4.140e+02, percent-clipped=0.0 2023-10-14 11:30:21,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1691764.6666666667, ans=0.2 2023-10-14 11:30:22,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1691764.6666666667, ans=0.0 2023-10-14 11:30:22,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1691764.6666666667, ans=0.0 2023-10-14 11:30:24,427 INFO [train.py:1031] (0/4) Epoch 27, batch 7500, loss[loss=0.1814, simple_loss=0.2704, pruned_loss=0.04618, over 15896.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2788, pruned_loss=0.04714, over 32055622.06 frames. ], batch size: 43, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 11:31:46,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1691998.0, ans=0.125 2023-10-14 11:31:53,409 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:31:57,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1692044.6666666667, ans=0.125 2023-10-14 11:31:59,435 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-14 11:32:13,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1692044.6666666667, ans=0.125 2023-10-14 11:32:14,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1692091.3333333333, ans=0.125 2023-10-14 11:32:21,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1692091.3333333333, ans=0.0 2023-10-14 11:32:37,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.61 vs. limit=15.0 2023-10-14 11:33:07,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1692184.6666666667, ans=0.07 2023-10-14 11:33:15,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.11 vs. limit=22.5 2023-10-14 11:33:21,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.902e+02 2.110e+02 2.353e+02 3.247e+02, threshold=4.221e+02, percent-clipped=0.0 2023-10-14 11:34:03,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1692371.3333333333, ans=0.125 2023-10-14 11:34:13,683 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-10-14 11:34:16,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1692371.3333333333, ans=0.2 2023-10-14 11:34:38,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1692418.0, ans=0.2 2023-10-14 11:34:47,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1692464.6666666667, ans=6.0 2023-10-14 11:34:49,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.11 vs. limit=15.0 2023-10-14 11:35:09,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1692511.3333333333, ans=0.0 2023-10-14 11:35:24,908 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-10-14 11:35:38,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1692558.0, ans=0.125 2023-10-14 11:35:44,432 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:35:44,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1692558.0, ans=0.0 2023-10-14 11:36:31,736 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.43 vs. limit=15.0 2023-10-14 11:36:53,900 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.830e+02 1.946e+02 2.162e+02 2.979e+02, threshold=3.891e+02, percent-clipped=0.0 2023-10-14 11:37:21,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1692791.3333333333, ans=0.125 2023-10-14 11:37:39,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1692838.0, ans=0.125 2023-10-14 11:37:46,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1692838.0, ans=0.125 2023-10-14 11:38:35,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1692931.3333333333, ans=0.2 2023-10-14 11:39:27,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1693071.3333333333, ans=6.0 2023-10-14 11:39:58,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1693118.0, ans=0.0 2023-10-14 11:40:09,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1693164.6666666667, ans=0.1 2023-10-14 11:40:17,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.849e+02 1.982e+02 2.118e+02 2.687e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-14 11:40:19,611 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:40:24,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1693211.3333333333, ans=0.2 2023-10-14 11:40:27,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1693211.3333333333, ans=0.2 2023-10-14 11:40:30,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.35 vs. limit=22.5 2023-10-14 11:40:48,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1693258.0, ans=0.0 2023-10-14 11:41:06,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1693304.6666666667, ans=0.04949747468305833 2023-10-14 11:41:08,598 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=12.0 2023-10-14 11:41:12,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1693304.6666666667, ans=0.125 2023-10-14 11:42:06,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1693398.0, ans=0.0 2023-10-14 11:42:09,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1693444.6666666667, ans=0.2 2023-10-14 11:42:09,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1693444.6666666667, ans=0.2 2023-10-14 11:43:07,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1693538.0, ans=0.1 2023-10-14 11:43:37,773 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.87 vs. limit=15.0 2023-10-14 11:44:14,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1693631.3333333333, ans=0.125 2023-10-14 11:44:15,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.789e+02 1.962e+02 2.292e+02 3.407e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-14 11:44:38,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1693678.0, ans=0.125 2023-10-14 11:45:12,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1693724.6666666667, ans=0.125 2023-10-14 11:45:18,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1693771.3333333333, ans=0.1 2023-10-14 11:45:37,431 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.28 vs. limit=6.0 2023-10-14 11:45:56,177 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.72 vs. limit=15.0 2023-10-14 11:46:03,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.63 vs. limit=22.5 2023-10-14 11:46:19,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1693864.6666666667, ans=0.2 2023-10-14 11:47:16,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1693958.0, ans=0.125 2023-10-14 11:47:19,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1693958.0, ans=0.0 2023-10-14 11:47:47,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1694051.3333333333, ans=0.1 2023-10-14 11:47:57,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-10-14 11:48:06,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1694051.3333333333, ans=0.125 2023-10-14 11:48:11,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1694051.3333333333, ans=0.125 2023-10-14 11:48:26,363 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.726e+02 1.863e+02 2.132e+02 2.948e+02, threshold=3.725e+02, percent-clipped=0.0 2023-10-14 11:48:30,866 INFO [train.py:1031] (0/4) Epoch 27, batch 8000, loss[loss=0.2127, simple_loss=0.2868, pruned_loss=0.06925, over 15702.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2783, pruned_loss=0.04673, over 32230509.95 frames. ], batch size: 350, lr: 1.27e-03, grad_scale: 32.0 2023-10-14 11:48:54,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1694191.3333333333, ans=0.125 2023-10-14 11:49:16,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1694284.6666666667, ans=0.015 2023-10-14 11:49:27,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-10-14 11:49:44,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1694378.0, ans=0.2 2023-10-14 11:49:47,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1694378.0, ans=0.1 2023-10-14 11:49:48,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1694378.0, ans=0.125 2023-10-14 11:50:10,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1694471.3333333333, ans=0.0 2023-10-14 11:50:12,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-10-14 11:50:15,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1694518.0, ans=0.2 2023-10-14 11:50:23,680 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.48 vs. limit=15.0 2023-10-14 11:50:29,164 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.68 vs. limit=15.0 2023-10-14 11:50:30,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1694564.6666666667, ans=0.125 2023-10-14 11:50:33,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1694564.6666666667, ans=0.0 2023-10-14 11:50:33,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1694564.6666666667, ans=0.1 2023-10-14 11:50:36,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.355e+02 1.732e+02 1.871e+02 2.069e+02 2.971e+02, threshold=3.742e+02, percent-clipped=0.0 2023-10-14 11:50:39,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1694611.3333333333, ans=0.1 2023-10-14 11:50:51,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1694658.0, ans=0.1 2023-10-14 11:51:25,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1694751.3333333333, ans=0.125 2023-10-14 11:52:04,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1694891.3333333333, ans=0.125 2023-10-14 11:52:36,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1694938.0, ans=0.2 2023-10-14 11:52:44,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-10-14 11:52:47,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1694984.6666666667, ans=0.125 2023-10-14 11:52:59,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1695031.3333333333, ans=0.0 2023-10-14 11:53:07,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1695078.0, ans=0.125 2023-10-14 11:53:08,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.798e+02 2.003e+02 2.195e+02 2.903e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-14 11:53:09,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1695078.0, ans=0.125 2023-10-14 11:53:09,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-10-14 11:53:39,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1695171.3333333333, ans=0.2 2023-10-14 11:53:45,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1695218.0, ans=0.1 2023-10-14 11:53:48,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1695218.0, ans=0.0 2023-10-14 11:53:56,499 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.34 vs. limit=15.0 2023-10-14 11:53:57,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1695264.6666666667, ans=0.2 2023-10-14 11:54:07,671 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1695264.6666666667, ans=0.125 2023-10-14 11:54:21,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1695311.3333333333, ans=0.0 2023-10-14 11:54:42,860 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:55:00,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1695498.0, ans=0.09899494936611666 2023-10-14 11:55:13,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1695498.0, ans=0.1 2023-10-14 11:55:16,315 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.763e+02 1.934e+02 2.175e+02 3.068e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-14 11:55:24,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1695544.6666666667, ans=0.125 2023-10-14 11:55:29,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1695591.3333333333, ans=0.0 2023-10-14 11:55:37,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1695591.3333333333, ans=0.04949747468305833 2023-10-14 11:55:43,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1695638.0, ans=0.0 2023-10-14 11:55:59,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1695684.6666666667, ans=0.125 2023-10-14 11:56:46,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=12.0 2023-10-14 11:56:49,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1695871.3333333333, ans=0.0 2023-10-14 11:56:49,904 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:57:05,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1695918.0, ans=0.1 2023-10-14 11:57:24,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1695964.6666666667, ans=0.0 2023-10-14 11:57:26,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.845e+02 1.970e+02 2.193e+02 2.917e+02, threshold=3.940e+02, percent-clipped=0.0 2023-10-14 11:57:34,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1696011.3333333333, ans=0.07 2023-10-14 11:57:43,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1696058.0, ans=0.125 2023-10-14 11:57:46,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1696058.0, ans=0.125 2023-10-14 11:57:50,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1696058.0, ans=0.2 2023-10-14 11:58:14,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1696151.3333333333, ans=0.07 2023-10-14 11:58:27,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1696198.0, ans=0.2 2023-10-14 11:58:32,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1696198.0, ans=0.0 2023-10-14 11:58:47,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1696244.6666666667, ans=10.0 2023-10-14 11:58:47,742 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.51 vs. limit=15.0 2023-10-14 11:59:03,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1696338.0, ans=0.0 2023-10-14 11:59:11,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-10-14 11:59:42,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1696431.3333333333, ans=0.125 2023-10-14 11:59:46,007 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 11:59:47,896 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.842e+02 2.027e+02 2.240e+02 4.078e+02, threshold=4.054e+02, percent-clipped=1.0 2023-10-14 11:59:47,928 INFO [train.py:1031] (0/4) Epoch 27, batch 8500, loss[loss=0.1925, simple_loss=0.2816, pruned_loss=0.05168, over 15226.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2785, pruned_loss=0.04649, over 32369365.53 frames. ], batch size: 35, lr: 1.27e-03, grad_scale: 16.0 2023-10-14 12:00:06,521 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=15.0 2023-10-14 12:00:08,886 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.55 vs. limit=15.0 2023-10-14 12:00:25,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1696571.3333333333, ans=0.125 2023-10-14 12:00:34,378 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.64 vs. limit=15.0 2023-10-14 12:00:53,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1696711.3333333333, ans=0.125 2023-10-14 12:01:17,749 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.55 vs. limit=12.0 2023-10-14 12:01:31,325 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.41 vs. limit=15.0 2023-10-14 12:01:34,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1696851.3333333333, ans=0.0 2023-10-14 12:01:37,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1696851.3333333333, ans=0.2 2023-10-14 12:01:42,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1696851.3333333333, ans=0.0 2023-10-14 12:01:47,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1696898.0, ans=0.125 2023-10-14 12:01:49,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1696898.0, ans=0.0 2023-10-14 12:01:51,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1696898.0, ans=0.125 2023-10-14 12:02:03,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.895e+02 2.088e+02 2.354e+02 3.023e+02, threshold=4.177e+02, percent-clipped=0.0 2023-10-14 12:02:18,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1696991.3333333333, ans=0.2 2023-10-14 12:02:24,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1696991.3333333333, ans=0.0 2023-10-14 12:02:24,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1696991.3333333333, ans=0.1 2023-10-14 12:02:25,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1696991.3333333333, ans=0.0 2023-10-14 12:02:29,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1696991.3333333333, ans=0.125 2023-10-14 12:02:49,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1697084.6666666667, ans=0.04949747468305833 2023-10-14 12:03:19,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=8.0 2023-10-14 12:03:35,204 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.63 vs. limit=22.5 2023-10-14 12:04:21,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1697364.6666666667, ans=0.125 2023-10-14 12:04:35,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.754e+02 1.904e+02 2.182e+02 3.245e+02, threshold=3.809e+02, percent-clipped=0.0 2023-10-14 12:04:48,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1697411.3333333333, ans=0.125 2023-10-14 12:04:50,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1697458.0, ans=0.125 2023-10-14 12:05:29,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1697551.3333333333, ans=0.0 2023-10-14 12:05:29,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1697551.3333333333, ans=0.125 2023-10-14 12:05:38,824 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=12.0 2023-10-14 12:06:26,595 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.59 vs. limit=22.5 2023-10-14 12:06:27,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1697691.3333333333, ans=0.1 2023-10-14 12:06:29,271 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1697691.3333333333, ans=0.09899494936611666 2023-10-14 12:06:30,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=12.0 2023-10-14 12:06:55,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=1697784.6666666667, ans=0.025 2023-10-14 12:07:08,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1697831.3333333333, ans=0.0 2023-10-14 12:07:21,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.726e+02 1.883e+02 2.126e+02 2.624e+02, threshold=3.766e+02, percent-clipped=0.0 2023-10-14 12:07:27,652 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-10-14 12:07:42,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1697924.6666666667, ans=0.2 2023-10-14 12:07:45,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1697924.6666666667, ans=0.2 2023-10-14 12:08:26,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1698064.6666666667, ans=0.09899494936611666 2023-10-14 12:08:46,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1698111.3333333333, ans=0.0 2023-10-14 12:08:47,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1698111.3333333333, ans=0.125 2023-10-14 12:09:02,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1698158.0, ans=0.035 2023-10-14 12:09:10,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1698158.0, ans=0.2 2023-10-14 12:09:15,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1698204.6666666667, ans=0.1 2023-10-14 12:09:21,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1698204.6666666667, ans=0.2 2023-10-14 12:09:27,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1698204.6666666667, ans=0.0 2023-10-14 12:09:35,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1698251.3333333333, ans=0.0 2023-10-14 12:09:36,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1698251.3333333333, ans=0.2 2023-10-14 12:09:36,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1698251.3333333333, ans=0.0 2023-10-14 12:09:49,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1698298.0, ans=0.125 2023-10-14 12:10:06,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.793e+02 1.960e+02 2.156e+02 3.979e+02, threshold=3.920e+02, percent-clipped=1.0 2023-10-14 12:10:08,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1698344.6666666667, ans=0.125 2023-10-14 12:10:13,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1698344.6666666667, ans=0.05 2023-10-14 12:10:24,157 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.40 vs. limit=15.0 2023-10-14 12:11:13,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1698531.3333333333, ans=0.0 2023-10-14 12:11:51,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1698624.6666666667, ans=0.0 2023-10-14 12:12:26,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1698718.0, ans=0.125 2023-10-14 12:12:42,627 INFO [train.py:1031] (0/4) Epoch 27, batch 9000, loss[loss=0.182, simple_loss=0.2748, pruned_loss=0.04457, over 16540.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2778, pruned_loss=0.04626, over 32482701.03 frames. ], batch size: 66, lr: 1.27e-03, grad_scale: 16.0 2023-10-14 12:12:43,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.788e+02 1.991e+02 2.198e+02 3.303e+02, threshold=3.982e+02, percent-clipped=0.0 2023-10-14 12:13:01,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1698858.0, ans=0.0 2023-10-14 12:13:10,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1698858.0, ans=0.125 2023-10-14 12:13:17,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1698904.6666666667, ans=6.0 2023-10-14 12:13:20,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1698904.6666666667, ans=0.0 2023-10-14 12:13:38,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1698951.3333333333, ans=0.125 2023-10-14 12:14:18,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1699091.3333333333, ans=0.0 2023-10-14 12:14:23,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1699091.3333333333, ans=0.2 2023-10-14 12:14:44,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1699138.0, ans=0.125 2023-10-14 12:14:49,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1699184.6666666667, ans=0.125 2023-10-14 12:15:05,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.58 vs. limit=22.5 2023-10-14 12:15:17,899 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.788e+02 1.911e+02 2.190e+02 2.862e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-14 12:15:18,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1699278.0, ans=0.04949747468305833 2023-10-14 12:15:43,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1699324.6666666667, ans=0.125 2023-10-14 12:16:09,682 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-10-14 12:16:27,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1699464.6666666667, ans=0.0 2023-10-14 12:16:34,978 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.93 vs. limit=15.0 2023-10-14 12:16:41,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1699511.3333333333, ans=0.5 2023-10-14 12:17:22,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1699604.6666666667, ans=0.0 2023-10-14 12:17:45,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1699651.3333333333, ans=0.125 2023-10-14 12:17:59,826 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-10-14 12:18:14,495 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.855e+02 2.006e+02 2.321e+02 3.044e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-14 12:18:38,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699791.3333333333, ans=0.1 2023-10-14 12:18:48,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1699838.0, ans=0.0 2023-10-14 12:18:53,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1699838.0, ans=0.0 2023-10-14 12:19:00,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1699884.6666666667, ans=0.0 2023-10-14 12:19:08,263 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-10-14 12:19:20,271 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.01 vs. limit=12.0 2023-10-14 12:19:22,139 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-10-14 12:19:28,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1699931.3333333333, ans=10.0 2023-10-14 12:19:53,254 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.22 vs. limit=15.0 2023-10-14 12:19:54,311 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1700024.6666666667, ans=0.2 2023-10-14 12:20:11,838 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.99 vs. limit=22.5 2023-10-14 12:20:16,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1700071.3333333333, ans=0.1 2023-10-14 12:20:18,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1700071.3333333333, ans=0.0 2023-10-14 12:20:30,321 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.57 vs. limit=5.0 2023-10-14 12:21:03,228 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.867e+02 2.001e+02 2.359e+02 3.015e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-14 12:21:13,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1700211.3333333333, ans=0.125 2023-10-14 12:21:25,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1700258.0, ans=0.09899494936611666 2023-10-14 12:21:26,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1700258.0, ans=0.2 2023-10-14 12:21:39,419 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=12.0 2023-10-14 12:22:11,219 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.51 vs. limit=15.0 2023-10-14 12:22:43,728 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.54 vs. limit=22.5 2023-10-14 12:22:58,698 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-10-14 12:23:18,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1700538.0, ans=0.125 2023-10-14 12:23:30,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1700538.0, ans=0.1 2023-10-14 12:23:44,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1700584.6666666667, ans=0.05 2023-10-14 12:24:00,642 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.38 vs. limit=10.0 2023-10-14 12:24:19,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.862e+02 2.114e+02 2.421e+02 3.565e+02, threshold=4.228e+02, percent-clipped=0.0 2023-10-14 12:24:28,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1700678.0, ans=0.125 2023-10-14 12:24:36,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1700724.6666666667, ans=0.1 2023-10-14 12:24:57,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1700771.3333333333, ans=0.05 2023-10-14 12:25:22,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1700818.0, ans=0.1 2023-10-14 12:25:25,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1700818.0, ans=0.0 2023-10-14 12:25:29,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1700818.0, ans=0.125 2023-10-14 12:25:35,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1700864.6666666667, ans=0.125 2023-10-14 12:25:40,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1700864.6666666667, ans=0.1 2023-10-14 12:25:56,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1700911.3333333333, ans=0.0 2023-10-14 12:26:43,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1701004.6666666667, ans=0.0 2023-10-14 12:27:05,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1701098.0, ans=0.2 2023-10-14 12:27:20,219 INFO [train.py:1031] (0/4) Epoch 27, batch 9500, loss[loss=0.1703, simple_loss=0.2689, pruned_loss=0.03582, over 16906.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2785, pruned_loss=0.0466, over 32552303.26 frames. ], batch size: 77, lr: 1.27e-03, grad_scale: 16.0 2023-10-14 12:27:25,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.860e+02 2.035e+02 2.255e+02 3.674e+02, threshold=4.071e+02, percent-clipped=0.0 2023-10-14 12:27:51,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1701191.3333333333, ans=0.0 2023-10-14 12:27:51,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1701191.3333333333, ans=0.1 2023-10-14 12:28:11,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1701238.0, ans=0.0 2023-10-14 12:28:31,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1701284.6666666667, ans=0.1 2023-10-14 12:29:08,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1701378.0, ans=0.125 2023-10-14 12:29:48,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1701424.6666666667, ans=0.1 2023-10-14 12:30:21,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1701518.0, ans=0.95 2023-10-14 12:30:37,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1701564.6666666667, ans=0.125 2023-10-14 12:30:37,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1701564.6666666667, ans=0.0 2023-10-14 12:31:04,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.878e+02 2.031e+02 2.227e+02 3.316e+02, threshold=4.061e+02, percent-clipped=0.0 2023-10-14 12:31:18,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1701611.3333333333, ans=0.125 2023-10-14 12:31:43,460 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:32:20,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1701751.3333333333, ans=0.1 2023-10-14 12:32:26,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1701751.3333333333, ans=0.1 2023-10-14 12:32:26,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1701751.3333333333, ans=0.1 2023-10-14 12:32:31,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1701751.3333333333, ans=0.0 2023-10-14 12:33:43,472 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.51 vs. limit=15.0 2023-10-14 12:34:10,249 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.89 vs. limit=15.0 2023-10-14 12:34:10,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.97 vs. limit=22.5 2023-10-14 12:34:12,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1701938.0, ans=0.125 2023-10-14 12:34:43,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1701984.6666666667, ans=0.125 2023-10-14 12:34:51,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1702031.3333333333, ans=0.0 2023-10-14 12:34:57,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1702031.3333333333, ans=0.125 2023-10-14 12:35:13,307 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.817e+02 1.988e+02 2.380e+02 4.107e+02, threshold=3.977e+02, percent-clipped=1.0 2023-10-14 12:35:30,506 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-10-14 12:35:32,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1702124.6666666667, ans=0.2 2023-10-14 12:35:36,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1702124.6666666667, ans=0.125 2023-10-14 12:35:40,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1702171.3333333333, ans=0.125 2023-10-14 12:35:43,128 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:36:02,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1702218.0, ans=0.1 2023-10-14 12:36:17,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=1702264.6666666667, ans=15.0 2023-10-14 12:36:22,318 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.85 vs. limit=15.0 2023-10-14 12:36:24,275 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-10-14 12:36:31,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1702311.3333333333, ans=0.0 2023-10-14 12:36:32,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1702358.0, ans=0.0 2023-10-14 12:36:38,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1702358.0, ans=0.0 2023-10-14 12:36:45,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1702404.6666666667, ans=0.0 2023-10-14 12:36:53,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1702404.6666666667, ans=0.125 2023-10-14 12:37:12,822 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.96 vs. limit=15.0 2023-10-14 12:37:26,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.860e+02 2.027e+02 2.365e+02 3.468e+02, threshold=4.054e+02, percent-clipped=0.0 2023-10-14 12:37:29,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.89 vs. limit=22.5 2023-10-14 12:37:31,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1702544.6666666667, ans=0.125 2023-10-14 12:37:49,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1702638.0, ans=0.015 2023-10-14 12:38:02,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1702684.6666666667, ans=0.0 2023-10-14 12:38:03,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1702684.6666666667, ans=0.125 2023-10-14 12:38:08,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1702684.6666666667, ans=0.0 2023-10-14 12:38:18,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1702731.3333333333, ans=0.125 2023-10-14 12:38:46,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1702871.3333333333, ans=0.125 2023-10-14 12:39:03,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1702918.0, ans=0.125 2023-10-14 12:39:26,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1703011.3333333333, ans=0.1 2023-10-14 12:39:26,782 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.758e+02 1.934e+02 2.156e+02 2.765e+02, threshold=3.868e+02, percent-clipped=0.0 2023-10-14 12:39:31,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1703011.3333333333, ans=0.0 2023-10-14 12:39:53,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1703104.6666666667, ans=0.125 2023-10-14 12:39:55,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1703104.6666666667, ans=0.025 2023-10-14 12:39:59,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1703104.6666666667, ans=0.125 2023-10-14 12:40:00,412 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.19 vs. limit=15.0 2023-10-14 12:40:04,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1703151.3333333333, ans=0.125 2023-10-14 12:40:16,389 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.76 vs. limit=15.0 2023-10-14 12:40:21,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1703198.0, ans=0.0 2023-10-14 12:40:23,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1703244.6666666667, ans=0.07 2023-10-14 12:40:30,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1703244.6666666667, ans=0.125 2023-10-14 12:40:47,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1703338.0, ans=0.125 2023-10-14 12:41:18,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1703431.3333333333, ans=0.125 2023-10-14 12:41:20,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.61 vs. limit=15.0 2023-10-14 12:41:21,821 INFO [train.py:1031] (0/4) Epoch 27, batch 10000, loss[loss=0.2177, simple_loss=0.2895, pruned_loss=0.073, over 15602.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2776, pruned_loss=0.04635, over 32590359.52 frames. ], batch size: 350, lr: 1.26e-03, grad_scale: 32.0 2023-10-14 12:41:23,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.778e+02 1.981e+02 2.187e+02 3.000e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-14 12:41:26,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1703478.0, ans=0.125 2023-10-14 12:41:27,614 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1703478.0, ans=0.0 2023-10-14 12:41:44,875 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.77 vs. limit=15.0 2023-10-14 12:41:46,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1703571.3333333333, ans=0.125 2023-10-14 12:41:54,177 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.571e-03 2023-10-14 12:42:08,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1703664.6666666667, ans=0.125 2023-10-14 12:42:25,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1703711.3333333333, ans=0.1 2023-10-14 12:42:28,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1703758.0, ans=0.1 2023-10-14 12:42:35,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1703758.0, ans=0.0 2023-10-14 12:42:59,747 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1703851.3333333333, ans=0.0 2023-10-14 12:43:00,037 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.62 vs. limit=15.0 2023-10-14 12:43:00,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1703851.3333333333, ans=0.125 2023-10-14 12:43:24,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.864e+02 2.090e+02 2.266e+02 2.870e+02, threshold=4.181e+02, percent-clipped=0.0 2023-10-14 12:43:29,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1703944.6666666667, ans=0.0 2023-10-14 12:43:43,546 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.28 vs. limit=10.0 2023-10-14 12:43:48,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1704038.0, ans=0.125 2023-10-14 12:44:24,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1704178.0, ans=0.125 2023-10-14 12:44:48,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1704271.3333333333, ans=0.125 2023-10-14 12:45:11,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704364.6666666667, ans=0.1 2023-10-14 12:45:19,318 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:45:22,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1704411.3333333333, ans=0.125 2023-10-14 12:45:25,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.816e+02 2.032e+02 2.324e+02 3.198e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-14 12:45:26,448 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.71 vs. limit=10.0 2023-10-14 12:45:45,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1704458.0, ans=0.2 2023-10-14 12:45:49,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1704504.6666666667, ans=0.125 2023-10-14 12:45:55,891 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:47:04,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.61 vs. limit=22.5 2023-10-14 12:47:21,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1704831.3333333333, ans=0.05 2023-10-14 12:47:29,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1704878.0, ans=0.125 2023-10-14 12:47:33,539 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.857e+02 2.010e+02 2.194e+02 3.364e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-14 12:47:35,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1704878.0, ans=0.0 2023-10-14 12:47:36,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1704878.0, ans=0.0 2023-10-14 12:47:57,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1704971.3333333333, ans=0.125 2023-10-14 12:48:02,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704971.3333333333, ans=0.1 2023-10-14 12:48:35,246 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=12.0 2023-10-14 12:48:36,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.53 vs. limit=22.5 2023-10-14 12:48:57,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1705204.6666666667, ans=0.125 2023-10-14 12:48:57,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1705204.6666666667, ans=0.0 2023-10-14 12:49:07,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1705204.6666666667, ans=0.125 2023-10-14 12:49:23,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1705298.0, ans=0.04949747468305833 2023-10-14 12:49:39,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1705344.6666666667, ans=0.125 2023-10-14 12:49:42,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.761e+02 1.907e+02 2.084e+02 2.890e+02, threshold=3.813e+02, percent-clipped=0.0 2023-10-14 12:49:43,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1705344.6666666667, ans=0.125 2023-10-14 12:50:03,361 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:50:10,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1705438.0, ans=0.2 2023-10-14 12:50:44,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1705578.0, ans=0.0 2023-10-14 12:50:45,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1705578.0, ans=0.0 2023-10-14 12:50:45,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1705578.0, ans=0.2 2023-10-14 12:50:46,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1705578.0, ans=0.125 2023-10-14 12:50:52,109 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.87 vs. limit=10.0 2023-10-14 12:50:54,155 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 12:50:55,049 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1705624.6666666667, ans=0.125 2023-10-14 12:51:02,108 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.80 vs. limit=22.5 2023-10-14 12:51:21,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1705718.0, ans=0.125 2023-10-14 12:51:21,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1705718.0, ans=0.125 2023-10-14 12:51:34,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1705764.6666666667, ans=0.1 2023-10-14 12:51:37,203 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1705764.6666666667, ans=0.0 2023-10-14 12:51:44,448 INFO [train.py:1031] (0/4) Epoch 27, batch 10500, loss[loss=0.204, simple_loss=0.2911, pruned_loss=0.05847, over 16490.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.278, pruned_loss=0.04637, over 32641228.24 frames. ], batch size: 266, lr: 1.26e-03, grad_scale: 32.0 2023-10-14 12:51:47,164 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.918e+02 2.127e+02 2.391e+02 3.501e+02, threshold=4.254e+02, percent-clipped=0.0 2023-10-14 12:52:05,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1705904.6666666667, ans=0.1 2023-10-14 12:52:10,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1705904.6666666667, ans=0.1 2023-10-14 12:52:16,504 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.25 vs. limit=10.0 2023-10-14 12:52:19,110 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-10-14 12:52:25,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1705951.3333333333, ans=0.0 2023-10-14 12:52:30,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.46 vs. limit=12.0 2023-10-14 12:52:34,233 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-10-14 12:52:36,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1705998.0, ans=0.5 2023-10-14 12:52:37,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1706044.6666666667, ans=0.125 2023-10-14 12:52:44,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1706044.6666666667, ans=0.125 2023-10-14 12:52:44,922 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.60 vs. limit=22.5 2023-10-14 12:53:54,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.841e+02 1.993e+02 2.231e+02 3.025e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-14 12:53:57,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1706278.0, ans=0.125 2023-10-14 12:54:00,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1706278.0, ans=0.2 2023-10-14 12:54:06,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1706324.6666666667, ans=0.1 2023-10-14 12:54:11,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1706324.6666666667, ans=0.125 2023-10-14 12:54:23,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1706371.3333333333, ans=0.2 2023-10-14 12:54:27,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1706418.0, ans=0.2 2023-10-14 12:54:34,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-10-14 12:54:36,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1706418.0, ans=0.2 2023-10-14 12:54:45,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1706464.6666666667, ans=0.125 2023-10-14 12:54:46,904 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-10-14 12:54:49,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1706464.6666666667, ans=0.0 2023-10-14 12:55:10,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1706558.0, ans=0.125 2023-10-14 12:55:15,711 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1706558.0, ans=0.0 2023-10-14 12:55:21,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1706604.6666666667, ans=0.125 2023-10-14 12:55:21,673 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.12 vs. limit=22.5 2023-10-14 12:55:25,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1706604.6666666667, ans=0.0 2023-10-14 12:55:26,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1706604.6666666667, ans=0.2 2023-10-14 12:55:28,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1706604.6666666667, ans=0.125 2023-10-14 12:55:52,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1706698.0, ans=0.125 2023-10-14 12:56:01,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1706744.6666666667, ans=0.125 2023-10-14 12:56:01,926 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.63 vs. limit=15.0 2023-10-14 12:56:03,307 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.840e+02 1.999e+02 2.244e+02 3.326e+02, threshold=3.999e+02, percent-clipped=0.0 2023-10-14 12:56:03,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1706744.6666666667, ans=0.0 2023-10-14 12:56:30,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1706838.0, ans=0.2 2023-10-14 12:57:00,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1706978.0, ans=0.1 2023-10-14 12:57:41,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1707118.0, ans=0.2 2023-10-14 12:58:09,120 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.960e+02 2.216e+02 2.530e+02 3.662e+02, threshold=4.433e+02, percent-clipped=0.0 2023-10-14 12:58:24,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1707258.0, ans=0.125 2023-10-14 12:58:39,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1707351.3333333333, ans=0.09899494936611666 2023-10-14 12:58:42,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-10-14 12:58:59,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1707398.0, ans=0.0 2023-10-14 12:59:17,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1707491.3333333333, ans=0.125 2023-10-14 12:59:23,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-10-14 12:59:33,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1707538.0, ans=0.125 2023-10-14 13:00:02,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1707631.3333333333, ans=0.125 2023-10-14 13:00:05,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1707678.0, ans=0.0 2023-10-14 13:00:09,348 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 1.765e+02 1.913e+02 2.138e+02 2.974e+02, threshold=3.827e+02, percent-clipped=0.0 2023-10-14 13:00:15,671 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:00:17,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1707724.6666666667, ans=0.125 2023-10-14 13:00:23,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1707724.6666666667, ans=0.125 2023-10-14 13:01:04,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=11.81 vs. limit=12.0 2023-10-14 13:01:13,175 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-10-14 13:01:19,140 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.09 vs. limit=15.0 2023-10-14 13:01:20,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1707958.0, ans=0.1 2023-10-14 13:01:27,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1708004.6666666667, ans=0.125 2023-10-14 13:01:55,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1708098.0, ans=0.2 2023-10-14 13:01:58,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1708098.0, ans=0.1 2023-10-14 13:02:04,510 INFO [train.py:1031] (0/4) Epoch 27, batch 11000, loss[loss=0.2008, simple_loss=0.2956, pruned_loss=0.05297, over 16599.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2781, pruned_loss=0.04654, over 32662744.56 frames. ], batch size: 219, lr: 1.26e-03, grad_scale: 16.0 2023-10-14 13:02:10,314 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.880e+02 1.989e+02 2.187e+02 3.195e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-14 13:02:14,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-10-14 13:02:40,546 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1708284.6666666667, ans=0.0 2023-10-14 13:02:43,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1708284.6666666667, ans=0.125 2023-10-14 13:02:53,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1708331.3333333333, ans=0.025 2023-10-14 13:02:53,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1708331.3333333333, ans=0.125 2023-10-14 13:02:54,204 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:03:12,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1708378.0, ans=0.0 2023-10-14 13:03:27,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1708424.6666666667, ans=0.125 2023-10-14 13:04:00,625 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.44 vs. limit=15.0 2023-10-14 13:04:13,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1708611.3333333333, ans=0.2 2023-10-14 13:04:13,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.66 vs. limit=15.0 2023-10-14 13:04:18,424 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.899e+02 2.032e+02 2.234e+02 3.272e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-14 13:04:36,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1708658.0, ans=0.0 2023-10-14 13:04:36,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-10-14 13:05:08,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1708798.0, ans=0.0 2023-10-14 13:05:15,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1708798.0, ans=0.125 2023-10-14 13:05:41,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1708891.3333333333, ans=0.125 2023-10-14 13:05:44,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1708891.3333333333, ans=0.0 2023-10-14 13:06:02,048 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-10-14 13:06:17,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1709031.3333333333, ans=0.1 2023-10-14 13:06:19,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1709031.3333333333, ans=0.1 2023-10-14 13:06:25,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1709031.3333333333, ans=0.125 2023-10-14 13:06:30,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1709078.0, ans=0.125 2023-10-14 13:06:35,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 1.805e+02 1.949e+02 2.191e+02 3.007e+02, threshold=3.897e+02, percent-clipped=0.0 2023-10-14 13:06:47,723 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1709124.6666666667, ans=0.0 2023-10-14 13:06:54,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1709171.3333333333, ans=0.1 2023-10-14 13:06:59,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1709171.3333333333, ans=0.125 2023-10-14 13:07:01,199 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.18 vs. limit=15.0 2023-10-14 13:07:02,948 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.91 vs. limit=15.0 2023-10-14 13:07:05,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1709171.3333333333, ans=0.125 2023-10-14 13:07:34,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1709311.3333333333, ans=0.0 2023-10-14 13:07:36,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1709311.3333333333, ans=0.0 2023-10-14 13:08:18,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1709451.3333333333, ans=0.125 2023-10-14 13:08:25,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1709498.0, ans=0.2 2023-10-14 13:08:45,570 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.784e+02 1.933e+02 2.117e+02 2.895e+02, threshold=3.865e+02, percent-clipped=0.0 2023-10-14 13:08:56,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1709591.3333333333, ans=0.0 2023-10-14 13:08:59,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1709591.3333333333, ans=0.0 2023-10-14 13:09:15,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1709638.0, ans=0.125 2023-10-14 13:09:19,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1709684.6666666667, ans=0.0 2023-10-14 13:09:30,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1709684.6666666667, ans=0.0 2023-10-14 13:09:31,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1709731.3333333333, ans=0.1 2023-10-14 13:09:32,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1709731.3333333333, ans=0.0 2023-10-14 13:09:39,923 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1709731.3333333333, ans=0.125 2023-10-14 13:09:51,023 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=5.98 vs. limit=15.0 2023-10-14 13:09:54,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1709778.0, ans=0.2 2023-10-14 13:10:24,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1709871.3333333333, ans=0.2 2023-10-14 13:10:34,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1709918.0, ans=0.0 2023-10-14 13:10:34,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1709918.0, ans=0.1 2023-10-14 13:10:38,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1709918.0, ans=10.0 2023-10-14 13:10:47,651 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.31 vs. limit=15.0 2023-10-14 13:11:07,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.811e+02 1.931e+02 2.184e+02 3.157e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-14 13:11:08,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.69 vs. limit=15.0 2023-10-14 13:11:14,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1710058.0, ans=0.0 2023-10-14 13:11:53,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1710198.0, ans=0.125 2023-10-14 13:12:26,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1710291.3333333333, ans=0.0 2023-10-14 13:12:29,168 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.88 vs. limit=10.0 2023-10-14 13:12:32,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.13 vs. limit=22.5 2023-10-14 13:12:51,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1710384.6666666667, ans=0.1 2023-10-14 13:13:04,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1710431.3333333333, ans=0.125 2023-10-14 13:13:11,229 INFO [train.py:1031] (0/4) Epoch 27, batch 11500, loss[loss=0.2086, simple_loss=0.2991, pruned_loss=0.05904, over 16916.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.278, pruned_loss=0.04639, over 32726718.44 frames. ], batch size: 165, lr: 1.26e-03, grad_scale: 32.0 2023-10-14 13:13:17,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.912e+02 2.093e+02 2.248e+02 3.057e+02, threshold=4.185e+02, percent-clipped=0.0 2023-10-14 13:13:31,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-10-14 13:13:44,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1710571.3333333333, ans=0.125 2023-10-14 13:13:53,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1710618.0, ans=0.125 2023-10-14 13:14:05,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-10-14 13:14:09,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1710664.6666666667, ans=0.0 2023-10-14 13:14:11,255 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.61 vs. limit=6.0 2023-10-14 13:14:19,762 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-10-14 13:14:22,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1710711.3333333333, ans=0.0 2023-10-14 13:14:27,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=12.0 2023-10-14 13:14:42,410 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.67 vs. limit=6.0 2023-10-14 13:14:51,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1710804.6666666667, ans=0.125 2023-10-14 13:15:01,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1710851.3333333333, ans=0.0 2023-10-14 13:15:38,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.871e+02 2.003e+02 2.202e+02 2.942e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-14 13:15:40,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1710944.6666666667, ans=0.125 2023-10-14 13:16:15,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2023-10-14 13:16:49,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1711131.3333333333, ans=0.125 2023-10-14 13:17:08,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1711224.6666666667, ans=15.0 2023-10-14 13:17:10,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1711224.6666666667, ans=0.0 2023-10-14 13:17:19,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.78 vs. limit=15.0 2023-10-14 13:17:36,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1711318.0, ans=0.125 2023-10-14 13:17:39,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1711318.0, ans=0.125 2023-10-14 13:17:50,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1711364.6666666667, ans=0.125 2023-10-14 13:18:00,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1711411.3333333333, ans=0.0 2023-10-14 13:18:06,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1711411.3333333333, ans=0.0 2023-10-14 13:18:07,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1711411.3333333333, ans=0.125 2023-10-14 13:18:07,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.769e+02 1.935e+02 2.107e+02 2.518e+02, threshold=3.869e+02, percent-clipped=0.0 2023-10-14 13:18:32,776 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-10-14 13:18:37,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1711551.3333333333, ans=0.125 2023-10-14 13:19:43,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-10-14 13:19:56,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1711738.0, ans=0.1 2023-10-14 13:20:11,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1711784.6666666667, ans=0.0 2023-10-14 13:20:27,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1711831.3333333333, ans=0.125 2023-10-14 13:20:38,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1711878.0, ans=0.125 2023-10-14 13:20:39,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1711878.0, ans=0.5 2023-10-14 13:20:41,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.835e+02 2.034e+02 2.337e+02 3.433e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-14 13:20:54,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1711924.6666666667, ans=0.1 2023-10-14 13:21:16,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1711971.3333333333, ans=0.0 2023-10-14 13:21:22,862 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.26 vs. limit=22.5 2023-10-14 13:21:43,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1712064.6666666667, ans=0.0 2023-10-14 13:21:44,673 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:22:31,075 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=22.5 2023-10-14 13:22:32,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1712204.6666666667, ans=0.035 2023-10-14 13:22:49,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1712251.3333333333, ans=0.1 2023-10-14 13:23:05,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1712298.0, ans=0.125 2023-10-14 13:23:06,211 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.19 vs. limit=15.0 2023-10-14 13:23:11,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1712344.6666666667, ans=0.125 2023-10-14 13:23:16,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1712344.6666666667, ans=0.125 2023-10-14 13:23:20,970 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.31 vs. limit=15.0 2023-10-14 13:23:21,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.898e+02 2.062e+02 2.343e+02 3.588e+02, threshold=4.124e+02, percent-clipped=0.0 2023-10-14 13:23:30,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1712391.3333333333, ans=0.125 2023-10-14 13:23:49,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1712438.0, ans=0.0 2023-10-14 13:24:27,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1712578.0, ans=0.0 2023-10-14 13:24:36,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1712624.6666666667, ans=0.0 2023-10-14 13:24:39,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.35 vs. limit=15.0 2023-10-14 13:24:44,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1712624.6666666667, ans=0.125 2023-10-14 13:24:48,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1712671.3333333333, ans=0.1 2023-10-14 13:24:50,063 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:24:52,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1712671.3333333333, ans=10.0 2023-10-14 13:24:56,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1712671.3333333333, ans=0.125 2023-10-14 13:25:06,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1712718.0, ans=0.125 2023-10-14 13:25:11,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1712718.0, ans=0.1 2023-10-14 13:25:27,550 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.20 vs. limit=10.0 2023-10-14 13:25:28,067 INFO [train.py:1031] (0/4) Epoch 27, batch 12000, loss[loss=0.1927, simple_loss=0.2905, pruned_loss=0.04745, over 16871.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.278, pruned_loss=0.04616, over 32750038.07 frames. ], batch size: 146, lr: 1.26e-03, grad_scale: 32.0 2023-10-14 13:25:36,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.793e+02 1.927e+02 2.044e+02 3.032e+02, threshold=3.855e+02, percent-clipped=0.0 2023-10-14 13:26:00,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1712904.6666666667, ans=0.125 2023-10-14 13:26:42,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1713044.6666666667, ans=0.125 2023-10-14 13:26:46,885 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.22 vs. limit=15.0 2023-10-14 13:26:49,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1713091.3333333333, ans=0.125 2023-10-14 13:27:05,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1713138.0, ans=0.1 2023-10-14 13:27:06,371 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1713138.0, ans=0.125 2023-10-14 13:27:06,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1713138.0, ans=0.04949747468305833 2023-10-14 13:27:14,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1713138.0, ans=0.125 2023-10-14 13:27:24,854 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=12.0 2023-10-14 13:27:27,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1713184.6666666667, ans=0.2 2023-10-14 13:27:43,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1713278.0, ans=0.125 2023-10-14 13:27:54,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.698e+02 1.900e+02 2.173e+02 2.726e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-14 13:28:24,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1713371.3333333333, ans=0.0 2023-10-14 13:28:35,689 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.00 vs. limit=15.0 2023-10-14 13:28:46,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1713464.6666666667, ans=0.125 2023-10-14 13:28:51,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1713511.3333333333, ans=0.0 2023-10-14 13:29:02,038 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:29:29,829 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-10-14 13:29:45,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1713698.0, ans=0.1 2023-10-14 13:30:05,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.852e+02 1.978e+02 2.177e+02 3.141e+02, threshold=3.955e+02, percent-clipped=0.0 2023-10-14 13:30:14,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1713791.3333333333, ans=10.0 2023-10-14 13:30:14,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1713791.3333333333, ans=0.125 2023-10-14 13:30:15,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1713791.3333333333, ans=0.07 2023-10-14 13:30:32,697 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-10-14 13:30:40,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1713884.6666666667, ans=0.125 2023-10-14 13:31:05,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1713978.0, ans=0.0 2023-10-14 13:31:11,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1713978.0, ans=0.125 2023-10-14 13:31:16,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1714024.6666666667, ans=0.0 2023-10-14 13:31:23,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1714024.6666666667, ans=0.05 2023-10-14 13:31:25,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1714024.6666666667, ans=0.015 2023-10-14 13:31:38,679 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1714071.3333333333, ans=0.0 2023-10-14 13:31:58,975 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.49 vs. limit=10.0 2023-10-14 13:32:12,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.924e+02 2.192e+02 2.512e+02 3.029e+02, threshold=4.384e+02, percent-clipped=0.0 2023-10-14 13:32:14,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1714211.3333333333, ans=0.0 2023-10-14 13:32:35,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1714304.6666666667, ans=0.0 2023-10-14 13:32:35,564 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2023-10-14 13:33:18,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1714491.3333333333, ans=0.1 2023-10-14 13:33:42,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1714538.0, ans=0.125 2023-10-14 13:33:52,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1714584.6666666667, ans=0.0 2023-10-14 13:34:12,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1714631.3333333333, ans=0.0 2023-10-14 13:34:32,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.806e+02 1.954e+02 2.145e+02 2.735e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-14 13:35:08,519 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.41 vs. limit=10.0 2023-10-14 13:35:15,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1714864.6666666667, ans=0.0 2023-10-14 13:35:28,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1714864.6666666667, ans=0.0 2023-10-14 13:35:42,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1714911.3333333333, ans=0.07 2023-10-14 13:36:02,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1715004.6666666667, ans=0.125 2023-10-14 13:36:26,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1715051.3333333333, ans=0.125 2023-10-14 13:36:30,955 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.07 vs. limit=15.0 2023-10-14 13:36:45,483 INFO [train.py:1031] (0/4) Epoch 27, batch 12500, loss[loss=0.1789, simple_loss=0.2767, pruned_loss=0.04058, over 16906.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2776, pruned_loss=0.04614, over 32752234.11 frames. ], batch size: 93, lr: 1.26e-03, grad_scale: 32.0 2023-10-14 13:36:46,900 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1715144.6666666667, ans=0.125 2023-10-14 13:36:53,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1715144.6666666667, ans=0.125 2023-10-14 13:36:56,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.832e+02 2.046e+02 2.257e+02 2.755e+02, threshold=4.091e+02, percent-clipped=0.0 2023-10-14 13:37:00,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1715191.3333333333, ans=0.04949747468305833 2023-10-14 13:37:15,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1715238.0, ans=0.2 2023-10-14 13:37:20,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1715238.0, ans=0.125 2023-10-14 13:37:21,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1715238.0, ans=0.125 2023-10-14 13:37:21,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1715238.0, ans=10.0 2023-10-14 13:37:48,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715331.3333333333, ans=0.1 2023-10-14 13:37:52,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1715331.3333333333, ans=0.125 2023-10-14 13:38:00,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1715378.0, ans=0.1 2023-10-14 13:38:24,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1715471.3333333333, ans=0.125 2023-10-14 13:38:42,629 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:39:00,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1715564.6666666667, ans=0.125 2023-10-14 13:39:16,490 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.834e+02 2.092e+02 2.359e+02 3.771e+02, threshold=4.183e+02, percent-clipped=0.0 2023-10-14 13:40:01,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1715798.0, ans=0.125 2023-10-14 13:40:05,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1715798.0, ans=0.125 2023-10-14 13:40:10,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.61 vs. limit=22.5 2023-10-14 13:40:24,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1715844.6666666667, ans=0.125 2023-10-14 13:40:55,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1715984.6666666667, ans=0.2 2023-10-14 13:40:58,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1715984.6666666667, ans=0.0 2023-10-14 13:41:10,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1716031.3333333333, ans=0.125 2023-10-14 13:41:28,235 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.817e+02 1.972e+02 2.139e+02 2.933e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-14 13:41:58,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1716171.3333333333, ans=0.0 2023-10-14 13:42:01,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1716218.0, ans=0.2 2023-10-14 13:42:10,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1716218.0, ans=0.0 2023-10-14 13:42:10,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1716218.0, ans=0.2 2023-10-14 13:42:11,032 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.46 vs. limit=22.5 2023-10-14 13:42:23,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.64 vs. limit=5.0 2023-10-14 13:42:48,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1716358.0, ans=0.1 2023-10-14 13:43:19,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1716498.0, ans=0.125 2023-10-14 13:43:25,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1716544.6666666667, ans=0.125 2023-10-14 13:43:31,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1716544.6666666667, ans=0.0 2023-10-14 13:43:32,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1716544.6666666667, ans=0.0 2023-10-14 13:43:33,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.932e+02 2.068e+02 2.298e+02 2.933e+02, threshold=4.136e+02, percent-clipped=0.0 2023-10-14 13:43:35,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1716544.6666666667, ans=0.0 2023-10-14 13:43:39,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1716591.3333333333, ans=0.125 2023-10-14 13:44:02,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1716684.6666666667, ans=0.1 2023-10-14 13:44:09,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1716731.3333333333, ans=0.125 2023-10-14 13:44:09,881 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=15.0 2023-10-14 13:44:36,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1716824.6666666667, ans=0.0 2023-10-14 13:45:02,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1716918.0, ans=0.125 2023-10-14 13:45:17,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1716964.6666666667, ans=0.125 2023-10-14 13:45:29,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1717011.3333333333, ans=0.1 2023-10-14 13:45:29,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.805e+02 1.907e+02 2.151e+02 2.759e+02, threshold=3.814e+02, percent-clipped=0.0 2023-10-14 13:45:34,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1717058.0, ans=0.0 2023-10-14 13:45:43,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1717104.6666666667, ans=0.125 2023-10-14 13:45:47,723 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-10-14 13:46:11,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1717198.0, ans=0.07 2023-10-14 13:46:13,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1717198.0, ans=0.125 2023-10-14 13:46:13,833 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=8.0 2023-10-14 13:46:19,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1717244.6666666667, ans=0.125 2023-10-14 13:46:20,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1717244.6666666667, ans=0.125 2023-10-14 13:46:24,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1717244.6666666667, ans=0.125 2023-10-14 13:46:36,930 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-368000.pt 2023-10-14 13:46:46,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1717338.0, ans=0.0 2023-10-14 13:46:58,480 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 13:47:14,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1717478.0, ans=0.125 2023-10-14 13:47:15,324 INFO [train.py:1031] (0/4) Epoch 27, batch 13000, loss[loss=0.195, simple_loss=0.2834, pruned_loss=0.05333, over 16449.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2784, pruned_loss=0.04631, over 32768982.10 frames. ], batch size: 266, lr: 1.26e-03, grad_scale: 16.0 2023-10-14 13:47:25,220 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.888e+02 2.006e+02 2.250e+02 2.756e+02, threshold=4.011e+02, percent-clipped=0.0 2023-10-14 13:47:32,647 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.06 vs. limit=15.0 2023-10-14 13:47:38,845 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-10-14 13:47:49,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1717571.3333333333, ans=0.125 2023-10-14 13:47:57,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1717618.0, ans=0.125 2023-10-14 13:48:01,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1717618.0, ans=0.125 2023-10-14 13:48:05,004 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-10-14 13:48:26,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1717711.3333333333, ans=0.125 2023-10-14 13:48:27,922 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.93 vs. limit=22.5 2023-10-14 13:48:40,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1717758.0, ans=0.125 2023-10-14 13:49:06,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1717851.3333333333, ans=0.125 2023-10-14 13:49:12,891 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.70 vs. limit=15.0 2023-10-14 13:49:24,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1717898.0, ans=0.125 2023-10-14 13:49:27,188 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-10-14 13:49:29,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1717944.6666666667, ans=0.125 2023-10-14 13:49:30,567 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.72 vs. limit=6.0 2023-10-14 13:49:35,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.817e+02 2.001e+02 2.337e+02 3.316e+02, threshold=4.002e+02, percent-clipped=0.0 2023-10-14 13:49:50,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.42 vs. limit=12.0 2023-10-14 13:49:55,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1718038.0, ans=0.0 2023-10-14 13:50:05,711 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-10-14 13:50:21,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1718131.3333333333, ans=0.0 2023-10-14 13:50:24,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1718131.3333333333, ans=0.0 2023-10-14 13:50:36,271 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.47 vs. limit=15.0 2023-10-14 13:50:38,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1718224.6666666667, ans=0.0 2023-10-14 13:50:59,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1718271.3333333333, ans=0.05 2023-10-14 13:51:01,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1718271.3333333333, ans=0.125 2023-10-14 13:51:17,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1718318.0, ans=0.125 2023-10-14 13:51:30,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1718411.3333333333, ans=0.0 2023-10-14 13:51:43,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.793e+02 1.940e+02 2.098e+02 2.749e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-14 13:52:05,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1718504.6666666667, ans=0.0 2023-10-14 13:52:09,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1718551.3333333333, ans=0.07 2023-10-14 13:52:19,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1718598.0, ans=0.0 2023-10-14 13:52:33,049 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=15.0 2023-10-14 13:52:40,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1718644.6666666667, ans=0.125 2023-10-14 13:52:50,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1718691.3333333333, ans=0.0 2023-10-14 13:53:41,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.834e+02 1.964e+02 2.143e+02 6.427e+02, threshold=3.927e+02, percent-clipped=1.0 2023-10-14 13:53:59,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1718971.3333333333, ans=0.125 2023-10-14 13:54:12,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.38 vs. limit=22.5 2023-10-14 13:55:26,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1719298.0, ans=0.125 2023-10-14 13:55:42,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.868e+02 1.981e+02 2.153e+02 2.879e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-14 13:55:51,751 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=12.0 2023-10-14 13:55:53,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1719391.3333333333, ans=0.2 2023-10-14 13:56:12,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1719484.6666666667, ans=0.125 2023-10-14 13:56:36,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1719578.0, ans=0.035 2023-10-14 13:56:47,125 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.91 vs. limit=22.5 2023-10-14 13:56:57,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.91 vs. limit=15.0 2023-10-14 13:57:06,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1719718.0, ans=0.125 2023-10-14 13:57:11,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1719718.0, ans=0.125 2023-10-14 13:57:25,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1719811.3333333333, ans=0.1 2023-10-14 13:57:26,538 INFO [train.py:1031] (0/4) Epoch 27, batch 13500, loss[loss=0.1682, simple_loss=0.2594, pruned_loss=0.03852, over 16048.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2777, pruned_loss=0.04628, over 32736099.26 frames. ], batch size: 43, lr: 1.26e-03, grad_scale: 16.0 2023-10-14 13:57:28,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1719811.3333333333, ans=0.125 2023-10-14 13:57:34,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1719811.3333333333, ans=0.1 2023-10-14 13:57:36,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.776e+02 1.926e+02 2.122e+02 2.807e+02, threshold=3.852e+02, percent-clipped=0.0 2023-10-14 13:57:49,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1719904.6666666667, ans=0.0 2023-10-14 13:57:55,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1719904.6666666667, ans=0.125 2023-10-14 13:58:09,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1719951.3333333333, ans=0.125 2023-10-14 13:58:39,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1720091.3333333333, ans=0.0 2023-10-14 13:58:48,610 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.33 vs. limit=15.0 2023-10-14 13:58:54,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1720138.0, ans=0.0 2023-10-14 13:58:55,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1720138.0, ans=0.125 2023-10-14 13:58:55,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1720138.0, ans=0.125 2023-10-14 13:59:22,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1720231.3333333333, ans=0.125 2023-10-14 13:59:23,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1720231.3333333333, ans=0.0 2023-10-14 13:59:26,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1720278.0, ans=0.0 2023-10-14 13:59:34,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.883e+02 2.023e+02 2.213e+02 2.839e+02, threshold=4.045e+02, percent-clipped=0.0 2023-10-14 13:59:49,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.46 vs. limit=15.0 2023-10-14 13:59:52,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1720371.3333333333, ans=0.1 2023-10-14 14:00:11,618 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.86 vs. limit=10.0 2023-10-14 14:00:25,422 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-27.pt 2023-10-14 14:00:58,594 INFO [train.py:1031] (0/4) Epoch 28, batch 0, loss[loss=0.1659, simple_loss=0.2648, pruned_loss=0.03351, over 16939.00 frames. ], tot_loss[loss=0.1659, simple_loss=0.2648, pruned_loss=0.03351, over 16939.00 frames. ], batch size: 87, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 14:00:58,596 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-14 14:01:08,805 INFO [train.py:1063] (0/4) Epoch 28, validation: loss=0.2128, simple_loss=0.2998, pruned_loss=0.06294, over 1020973.00 frames. 2023-10-14 14:01:08,806 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-14 14:01:29,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1720581.3333333333, ans=0.0 2023-10-14 14:01:37,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1720628.0, ans=0.0 2023-10-14 14:01:37,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1720628.0, ans=0.1 2023-10-14 14:01:59,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1720721.3333333333, ans=0.0 2023-10-14 14:01:59,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-10-14 14:02:14,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1720768.0, ans=0.1 2023-10-14 14:02:17,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.828e+02 2.010e+02 2.241e+02 3.487e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-14 14:02:18,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1720768.0, ans=0.125 2023-10-14 14:02:38,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1720861.3333333333, ans=0.125 2023-10-14 14:02:38,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1720861.3333333333, ans=0.125 2023-10-14 14:02:51,307 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=22.5 2023-10-14 14:02:54,445 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-10-14 14:02:58,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1720954.6666666667, ans=0.125 2023-10-14 14:03:09,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1721001.3333333333, ans=0.2 2023-10-14 14:03:10,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1721001.3333333333, ans=0.1 2023-10-14 14:03:30,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1721048.0, ans=0.0 2023-10-14 14:03:53,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1721141.3333333333, ans=0.125 2023-10-14 14:03:56,994 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.33 vs. limit=10.0 2023-10-14 14:04:01,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=1721188.0, ans=0.05 2023-10-14 14:04:19,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.794e+02 1.918e+02 2.187e+02 3.134e+02, threshold=3.836e+02, percent-clipped=0.0 2023-10-14 14:04:20,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1721234.6666666667, ans=0.125 2023-10-14 14:05:11,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1721468.0, ans=0.0 2023-10-14 14:05:21,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1721514.6666666667, ans=0.125 2023-10-14 14:05:54,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1721654.6666666667, ans=0.1 2023-10-14 14:06:18,517 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.476e+02 1.826e+02 1.960e+02 2.196e+02 4.557e+02, threshold=3.920e+02, percent-clipped=1.0 2023-10-14 14:06:28,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1721748.0, ans=0.125 2023-10-14 14:06:30,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1721748.0, ans=0.125 2023-10-14 14:06:38,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1721794.6666666667, ans=0.125 2023-10-14 14:06:40,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1721794.6666666667, ans=0.0 2023-10-14 14:06:47,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1721841.3333333333, ans=0.2 2023-10-14 14:06:49,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1721841.3333333333, ans=0.125 2023-10-14 14:07:02,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1721888.0, ans=0.0 2023-10-14 14:07:12,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1721934.6666666667, ans=0.125 2023-10-14 14:07:15,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1721934.6666666667, ans=0.125 2023-10-14 14:07:26,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1721981.3333333333, ans=0.0 2023-10-14 14:07:32,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1722028.0, ans=0.125 2023-10-14 14:07:35,260 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=22.5 2023-10-14 14:07:44,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1722074.6666666667, ans=0.125 2023-10-14 14:08:03,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1722121.3333333333, ans=0.1 2023-10-14 14:08:15,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-10-14 14:08:16,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.796e+02 2.003e+02 2.190e+02 2.865e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-14 14:08:30,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1722214.6666666667, ans=0.125 2023-10-14 14:09:16,145 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-10-14 14:09:26,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1722448.0, ans=0.125 2023-10-14 14:09:26,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1722448.0, ans=0.1 2023-10-14 14:09:32,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1722448.0, ans=0.125 2023-10-14 14:09:43,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1722494.6666666667, ans=0.0 2023-10-14 14:09:47,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1722541.3333333333, ans=0.0 2023-10-14 14:09:58,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1722588.0, ans=0.1 2023-10-14 14:10:19,899 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.830e+02 2.016e+02 2.217e+02 2.906e+02, threshold=4.033e+02, percent-clipped=0.0 2023-10-14 14:10:26,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1722681.3333333333, ans=6.0 2023-10-14 14:10:28,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1722681.3333333333, ans=10.0 2023-10-14 14:10:44,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1722728.0, ans=0.125 2023-10-14 14:10:47,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1722774.6666666667, ans=0.125 2023-10-14 14:10:49,118 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.47 vs. limit=15.0 2023-10-14 14:10:58,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1722774.6666666667, ans=0.1 2023-10-14 14:11:03,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1722821.3333333333, ans=0.125 2023-10-14 14:11:10,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1722821.3333333333, ans=0.2 2023-10-14 14:11:13,047 INFO [train.py:1031] (0/4) Epoch 28, batch 500, loss[loss=0.1901, simple_loss=0.2766, pruned_loss=0.05179, over 15219.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2776, pruned_loss=0.04606, over 7293249.30 frames. ], batch size: 35, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 14:11:19,471 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-10-14 14:11:23,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1722914.6666666667, ans=0.0 2023-10-14 14:11:27,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1722914.6666666667, ans=0.0 2023-10-14 14:11:36,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1722961.3333333333, ans=0.95 2023-10-14 14:11:43,134 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.86 vs. limit=22.5 2023-10-14 14:11:44,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1722961.3333333333, ans=0.0 2023-10-14 14:12:08,149 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.14 vs. limit=15.0 2023-10-14 14:12:17,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.930e+02 2.120e+02 2.333e+02 3.235e+02, threshold=4.239e+02, percent-clipped=0.0 2023-10-14 14:12:55,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1723241.3333333333, ans=0.2 2023-10-14 14:13:09,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.72 vs. limit=10.0 2023-10-14 14:13:14,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1723334.6666666667, ans=0.0 2023-10-14 14:13:22,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1723381.3333333333, ans=0.125 2023-10-14 14:13:31,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1723381.3333333333, ans=0.0 2023-10-14 14:13:41,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1723428.0, ans=0.125 2023-10-14 14:13:52,860 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:14:04,901 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.35 vs. limit=15.0 2023-10-14 14:14:11,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.99 vs. limit=6.0 2023-10-14 14:14:11,790 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.39 vs. limit=22.5 2023-10-14 14:14:13,385 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1723568.0, ans=0.0 2023-10-14 14:14:19,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1723568.0, ans=0.125 2023-10-14 14:14:19,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1723568.0, ans=0.1 2023-10-14 14:14:21,204 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.926e+02 2.134e+02 2.289e+02 3.205e+02, threshold=4.268e+02, percent-clipped=0.0 2023-10-14 14:14:30,735 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1723614.6666666667, ans=0.1 2023-10-14 14:14:35,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1723614.6666666667, ans=0.0 2023-10-14 14:14:42,203 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.00 vs. limit=15.0 2023-10-14 14:15:04,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1723754.6666666667, ans=0.125 2023-10-14 14:15:36,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1723848.0, ans=0.0 2023-10-14 14:15:46,784 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2023-10-14 14:15:56,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1723941.3333333333, ans=0.125 2023-10-14 14:16:08,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1723988.0, ans=0.1 2023-10-14 14:16:09,787 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.03 vs. limit=15.0 2023-10-14 14:16:18,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1724034.6666666667, ans=0.0 2023-10-14 14:16:25,980 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.883e+02 2.074e+02 2.300e+02 3.072e+02, threshold=4.149e+02, percent-clipped=0.0 2023-10-14 14:16:54,298 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.35 vs. limit=15.0 2023-10-14 14:16:56,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1724128.0, ans=0.2 2023-10-14 14:17:00,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.13 vs. limit=22.5 2023-10-14 14:17:04,495 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=12.0 2023-10-14 14:17:06,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1724174.6666666667, ans=0.125 2023-10-14 14:17:06,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1724174.6666666667, ans=0.125 2023-10-14 14:17:15,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1724221.3333333333, ans=0.0 2023-10-14 14:17:32,416 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=15.0 2023-10-14 14:17:33,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1724314.6666666667, ans=0.125 2023-10-14 14:18:14,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1724408.0, ans=0.0 2023-10-14 14:18:14,974 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1724408.0, ans=0.0 2023-10-14 14:18:15,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1724454.6666666667, ans=0.2 2023-10-14 14:18:35,532 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=12.0 2023-10-14 14:18:36,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.865e+02 1.999e+02 2.217e+02 3.500e+02, threshold=3.999e+02, percent-clipped=0.0 2023-10-14 14:18:42,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1724548.0, ans=0.95 2023-10-14 14:18:58,544 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:19:31,705 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-10-14 14:19:46,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1724781.3333333333, ans=0.125 2023-10-14 14:19:54,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.97 vs. limit=15.0 2023-10-14 14:20:04,066 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-10-14 14:20:07,908 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:20:11,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-10-14 14:20:38,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1724968.0, ans=0.2 2023-10-14 14:20:42,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.761e+02 1.908e+02 2.203e+02 3.021e+02, threshold=3.816e+02, percent-clipped=0.0 2023-10-14 14:21:02,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1725061.3333333333, ans=0.125 2023-10-14 14:21:06,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1725061.3333333333, ans=0.125 2023-10-14 14:21:16,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1725108.0, ans=0.07 2023-10-14 14:21:27,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1725154.6666666667, ans=0.0 2023-10-14 14:21:36,565 INFO [train.py:1031] (0/4) Epoch 28, batch 1000, loss[loss=0.2022, simple_loss=0.2965, pruned_loss=0.05394, over 16611.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2786, pruned_loss=0.04624, over 12951998.90 frames. ], batch size: 241, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 14:21:47,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1725248.0, ans=0.1 2023-10-14 14:21:52,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1725248.0, ans=0.025 2023-10-14 14:21:55,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1725248.0, ans=0.125 2023-10-14 14:21:55,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1725248.0, ans=0.125 2023-10-14 14:22:11,252 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1725341.3333333333, ans=0.1 2023-10-14 14:22:19,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1725341.3333333333, ans=0.125 2023-10-14 14:22:20,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1725341.3333333333, ans=0.125 2023-10-14 14:22:28,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1725388.0, ans=0.125 2023-10-14 14:22:41,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.826e+02 2.000e+02 2.197e+02 3.254e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-14 14:22:51,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1725481.3333333333, ans=0.125 2023-10-14 14:23:06,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1725528.0, ans=0.125 2023-10-14 14:23:09,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1725574.6666666667, ans=0.125 2023-10-14 14:23:36,116 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.75 vs. limit=22.5 2023-10-14 14:23:38,318 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.37 vs. limit=10.0 2023-10-14 14:23:39,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1725668.0, ans=0.125 2023-10-14 14:23:53,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1725714.6666666667, ans=0.2 2023-10-14 14:24:10,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1725761.3333333333, ans=0.1 2023-10-14 14:24:11,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1725761.3333333333, ans=0.1 2023-10-14 14:24:20,617 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1725808.0, ans=0.2 2023-10-14 14:24:32,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1725854.6666666667, ans=0.125 2023-10-14 14:24:37,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1725901.3333333333, ans=0.0 2023-10-14 14:24:38,220 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.88 vs. limit=22.5 2023-10-14 14:24:45,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.824e+02 2.004e+02 2.248e+02 2.999e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-14 14:25:07,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1725994.6666666667, ans=0.0 2023-10-14 14:25:13,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1725994.6666666667, ans=0.1 2023-10-14 14:25:33,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1726088.0, ans=0.125 2023-10-14 14:25:45,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.38 vs. limit=15.0 2023-10-14 14:25:53,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1726134.6666666667, ans=0.0 2023-10-14 14:26:13,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1726228.0, ans=0.2 2023-10-14 14:26:37,771 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.80 vs. limit=15.0 2023-10-14 14:26:48,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.728e+02 1.919e+02 2.127e+02 3.307e+02, threshold=3.838e+02, percent-clipped=0.0 2023-10-14 14:26:55,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1726414.6666666667, ans=0.0 2023-10-14 14:27:18,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1726508.0, ans=0.125 2023-10-14 14:27:40,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1726601.3333333333, ans=0.0 2023-10-14 14:27:48,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1726601.3333333333, ans=0.125 2023-10-14 14:27:57,301 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.85 vs. limit=15.0 2023-10-14 14:28:21,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1726741.3333333333, ans=0.1 2023-10-14 14:28:35,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1726788.0, ans=0.0 2023-10-14 14:28:52,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.770e+02 1.913e+02 2.146e+02 3.065e+02, threshold=3.826e+02, percent-clipped=0.0 2023-10-14 14:28:53,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1726834.6666666667, ans=0.2 2023-10-14 14:28:59,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1726881.3333333333, ans=0.1 2023-10-14 14:28:59,868 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=22.5 2023-10-14 14:29:01,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1726881.3333333333, ans=0.125 2023-10-14 14:29:17,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1726928.0, ans=0.1 2023-10-14 14:29:48,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1727068.0, ans=0.04949747468305833 2023-10-14 14:29:49,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1727068.0, ans=0.0 2023-10-14 14:29:53,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1727068.0, ans=0.025 2023-10-14 14:30:21,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1727208.0, ans=0.0 2023-10-14 14:30:21,729 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:30:26,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1727208.0, ans=0.125 2023-10-14 14:30:41,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1727254.6666666667, ans=0.1 2023-10-14 14:30:50,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.25 vs. limit=22.5 2023-10-14 14:30:58,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.821e+02 1.963e+02 2.160e+02 2.997e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-14 14:31:12,582 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:31:17,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1727394.6666666667, ans=15.0 2023-10-14 14:31:27,120 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.95 vs. limit=15.0 2023-10-14 14:31:34,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1727488.0, ans=0.125 2023-10-14 14:31:36,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1727488.0, ans=0.1 2023-10-14 14:31:49,253 INFO [train.py:1031] (0/4) Epoch 28, batch 1500, loss[loss=0.1844, simple_loss=0.2492, pruned_loss=0.05984, over 12530.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2773, pruned_loss=0.04588, over 17327704.63 frames. ], batch size: 440, lr: 1.23e-03, grad_scale: 16.0 2023-10-14 14:32:01,542 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-10-14 14:32:02,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1727581.3333333333, ans=0.125 2023-10-14 14:33:00,579 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.791e+02 1.976e+02 2.218e+02 3.078e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-14 14:33:22,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1727861.3333333333, ans=0.125 2023-10-14 14:33:25,147 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:33:25,353 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=15.0 2023-10-14 14:33:26,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1727861.3333333333, ans=0.125 2023-10-14 14:33:31,022 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-10-14 14:33:53,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1728001.3333333333, ans=0.125 2023-10-14 14:33:58,490 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.68 vs. limit=15.0 2023-10-14 14:34:31,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1728141.3333333333, ans=0.0 2023-10-14 14:34:33,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.81 vs. limit=6.0 2023-10-14 14:35:01,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1728234.6666666667, ans=0.125 2023-10-14 14:35:05,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1728234.6666666667, ans=0.2 2023-10-14 14:35:08,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1728234.6666666667, ans=0.0 2023-10-14 14:35:09,901 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.790e+02 1.899e+02 2.071e+02 3.153e+02, threshold=3.798e+02, percent-clipped=0.0 2023-10-14 14:35:31,986 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:35:42,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1728374.6666666667, ans=0.2 2023-10-14 14:36:23,343 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:36:37,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1728561.3333333333, ans=0.0 2023-10-14 14:36:43,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1728608.0, ans=0.125 2023-10-14 14:36:51,065 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.32 vs. limit=15.0 2023-10-14 14:36:56,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1728654.6666666667, ans=0.0 2023-10-14 14:37:04,206 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.45 vs. limit=15.0 2023-10-14 14:37:10,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.890e+02 2.067e+02 2.292e+02 3.487e+02, threshold=4.133e+02, percent-clipped=0.0 2023-10-14 14:37:14,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1728748.0, ans=0.125 2023-10-14 14:37:19,640 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-10-14 14:37:47,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1728841.3333333333, ans=0.0 2023-10-14 14:37:54,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1728888.0, ans=0.125 2023-10-14 14:38:10,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1728934.6666666667, ans=0.1 2023-10-14 14:38:15,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1728934.6666666667, ans=0.0 2023-10-14 14:38:21,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1728981.3333333333, ans=0.125 2023-10-14 14:38:27,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1728981.3333333333, ans=0.125 2023-10-14 14:38:37,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1729028.0, ans=0.1 2023-10-14 14:38:51,731 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.78 vs. limit=10.0 2023-10-14 14:39:12,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1729121.3333333333, ans=0.95 2023-10-14 14:39:28,867 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.806e+02 2.004e+02 2.184e+02 2.863e+02, threshold=4.009e+02, percent-clipped=0.0 2023-10-14 14:39:43,120 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.18 vs. limit=12.0 2023-10-14 14:39:53,732 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-10-14 14:40:00,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1729308.0, ans=0.125 2023-10-14 14:40:07,593 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.11 vs. limit=15.0 2023-10-14 14:40:16,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1729354.6666666667, ans=0.0 2023-10-14 14:40:20,044 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.88 vs. limit=15.0 2023-10-14 14:40:22,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1729354.6666666667, ans=0.125 2023-10-14 14:40:41,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1729448.0, ans=0.1 2023-10-14 14:40:59,833 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:41:04,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1729494.6666666667, ans=0.125 2023-10-14 14:41:11,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-10-14 14:41:22,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1729588.0, ans=0.1 2023-10-14 14:41:50,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.804e+02 1.932e+02 2.173e+02 3.347e+02, threshold=3.863e+02, percent-clipped=0.0 2023-10-14 14:42:12,940 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-10-14 14:42:15,659 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:42:17,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-10-14 14:42:18,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1729728.0, ans=0.0 2023-10-14 14:42:31,241 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-10-14 14:42:31,326 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.33 vs. limit=22.5 2023-10-14 14:42:51,513 INFO [train.py:1031] (0/4) Epoch 28, batch 2000, loss[loss=0.193, simple_loss=0.2904, pruned_loss=0.04773, over 16881.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2778, pruned_loss=0.04593, over 20756091.10 frames. ], batch size: 130, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 14:43:04,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1729868.0, ans=0.95 2023-10-14 14:43:38,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-10-14 14:43:42,915 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1729961.3333333333, ans=0.2 2023-10-14 14:44:13,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1730054.6666666667, ans=0.125 2023-10-14 14:44:19,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1730054.6666666667, ans=0.1 2023-10-14 14:44:30,628 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.871e+02 2.008e+02 2.185e+02 3.185e+02, threshold=4.015e+02, percent-clipped=0.0 2023-10-14 14:44:42,511 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1730148.0, ans=0.1 2023-10-14 14:44:47,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1730148.0, ans=0.0 2023-10-14 14:44:52,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1730194.6666666667, ans=0.125 2023-10-14 14:45:00,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1730194.6666666667, ans=0.0 2023-10-14 14:45:02,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.55 vs. limit=15.0 2023-10-14 14:45:09,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.76 vs. limit=22.5 2023-10-14 14:45:24,514 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:45:25,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1730288.0, ans=0.0 2023-10-14 14:45:44,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1730334.6666666667, ans=0.125 2023-10-14 14:46:22,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1730381.3333333333, ans=0.125 2023-10-14 14:46:45,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1730474.6666666667, ans=0.2 2023-10-14 14:47:04,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1730521.3333333333, ans=0.125 2023-10-14 14:47:32,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.818e+02 2.055e+02 2.294e+02 2.929e+02, threshold=4.110e+02, percent-clipped=0.0 2023-10-14 14:48:40,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1730801.3333333333, ans=0.125 2023-10-14 14:48:48,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1730801.3333333333, ans=0.1 2023-10-14 14:48:51,234 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1730801.3333333333, ans=0.0 2023-10-14 14:48:56,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1730848.0, ans=0.125 2023-10-14 14:49:00,119 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.62 vs. limit=22.5 2023-10-14 14:49:00,256 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-10-14 14:49:11,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1730894.6666666667, ans=0.125 2023-10-14 14:49:38,344 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:49:51,560 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:50:07,009 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.954e+02 2.147e+02 2.375e+02 3.281e+02, threshold=4.294e+02, percent-clipped=0.0 2023-10-14 14:50:09,017 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.11 vs. limit=15.0 2023-10-14 14:50:16,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=12.0 2023-10-14 14:50:22,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1731128.0, ans=0.0 2023-10-14 14:50:29,405 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.40 vs. limit=15.0 2023-10-14 14:50:30,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1731128.0, ans=0.125 2023-10-14 14:50:58,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1731221.3333333333, ans=0.0 2023-10-14 14:51:04,938 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1731268.0, ans=0.125 2023-10-14 14:51:14,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1731314.6666666667, ans=0.125 2023-10-14 14:51:15,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1731314.6666666667, ans=0.2 2023-10-14 14:51:19,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1731314.6666666667, ans=0.05 2023-10-14 14:51:48,516 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 14:52:18,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1731501.3333333333, ans=0.0 2023-10-14 14:52:18,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.956e+02 2.152e+02 2.408e+02 3.451e+02, threshold=4.305e+02, percent-clipped=0.0 2023-10-14 14:53:05,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-10-14 14:53:09,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731688.0, ans=0.1 2023-10-14 14:53:40,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731734.6666666667, ans=0.1 2023-10-14 14:53:58,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1731828.0, ans=0.1 2023-10-14 14:54:37,255 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.91 vs. limit=15.0 2023-10-14 14:54:47,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.829e+02 1.961e+02 2.138e+02 2.613e+02, threshold=3.922e+02, percent-clipped=0.0 2023-10-14 14:54:50,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1732014.6666666667, ans=0.125 2023-10-14 14:55:13,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1732108.0, ans=0.125 2023-10-14 14:55:15,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1732108.0, ans=0.5 2023-10-14 14:55:44,129 INFO [train.py:1031] (0/4) Epoch 28, batch 2500, loss[loss=0.199, simple_loss=0.2864, pruned_loss=0.05583, over 16587.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.278, pruned_loss=0.0462, over 23407781.13 frames. ], batch size: 241, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 14:56:00,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1732248.0, ans=0.125 2023-10-14 14:56:30,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1732341.3333333333, ans=0.125 2023-10-14 14:56:30,641 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-10-14 14:56:49,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1732388.0, ans=0.1 2023-10-14 14:56:55,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1732388.0, ans=0.0 2023-10-14 14:57:03,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1732434.6666666667, ans=0.125 2023-10-14 14:57:11,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.899e+02 2.042e+02 2.204e+02 2.818e+02, threshold=4.084e+02, percent-clipped=0.0 2023-10-14 14:57:11,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1732434.6666666667, ans=0.125 2023-10-14 14:57:28,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1732481.3333333333, ans=0.0 2023-10-14 14:57:31,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1732528.0, ans=0.0 2023-10-14 14:57:50,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1732574.6666666667, ans=0.1 2023-10-14 14:58:10,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1732621.3333333333, ans=0.1 2023-10-14 14:58:20,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1732668.0, ans=0.0 2023-10-14 14:58:47,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1732761.3333333333, ans=0.125 2023-10-14 14:58:57,606 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-10-14 14:59:01,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1732808.0, ans=0.0 2023-10-14 14:59:20,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1732854.6666666667, ans=0.125 2023-10-14 14:59:26,504 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=22.5 2023-10-14 14:59:38,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1732901.3333333333, ans=0.2 2023-10-14 14:59:38,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1732901.3333333333, ans=0.125 2023-10-14 14:59:43,670 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.849e+02 1.974e+02 2.089e+02 2.843e+02, threshold=3.949e+02, percent-clipped=0.0 2023-10-14 14:59:43,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1732901.3333333333, ans=0.0 2023-10-14 14:59:44,890 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1732948.0, ans=0.05 2023-10-14 14:59:50,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1732948.0, ans=0.0 2023-10-14 15:00:09,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1732994.6666666667, ans=0.2 2023-10-14 15:00:42,727 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:00:51,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1733134.6666666667, ans=0.0 2023-10-14 15:01:40,758 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.99 vs. limit=15.0 2023-10-14 15:02:16,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1733368.0, ans=0.0 2023-10-14 15:02:21,673 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-10-14 15:02:25,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.833e+02 2.051e+02 2.238e+02 2.778e+02, threshold=4.101e+02, percent-clipped=0.0 2023-10-14 15:02:32,385 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.07 vs. limit=22.5 2023-10-14 15:02:45,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1733461.3333333333, ans=0.2 2023-10-14 15:03:32,479 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1733554.6666666667, ans=0.1 2023-10-14 15:04:09,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1733694.6666666667, ans=0.125 2023-10-14 15:04:18,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1733741.3333333333, ans=0.5 2023-10-14 15:04:34,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.16 vs. limit=8.0 2023-10-14 15:04:44,773 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1733788.0, ans=0.0 2023-10-14 15:05:01,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1733834.6666666667, ans=0.1 2023-10-14 15:05:02,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1733834.6666666667, ans=0.05 2023-10-14 15:05:07,684 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.829e+02 2.030e+02 2.198e+02 3.122e+02, threshold=4.059e+02, percent-clipped=0.0 2023-10-14 15:05:19,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1733881.3333333333, ans=0.0 2023-10-14 15:05:24,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1733881.3333333333, ans=0.1 2023-10-14 15:05:27,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1733928.0, ans=0.1 2023-10-14 15:05:29,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1733928.0, ans=0.125 2023-10-14 15:05:57,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1733974.6666666667, ans=0.0 2023-10-14 15:06:05,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1734021.3333333333, ans=0.0 2023-10-14 15:06:34,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1734114.6666666667, ans=0.2 2023-10-14 15:07:03,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1734208.0, ans=0.2 2023-10-14 15:07:18,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1734254.6666666667, ans=0.125 2023-10-14 15:07:30,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1734301.3333333333, ans=0.125 2023-10-14 15:07:43,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.806e+02 2.035e+02 2.237e+02 2.651e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-14 15:08:44,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1734534.6666666667, ans=0.1 2023-10-14 15:08:44,838 INFO [train.py:1031] (0/4) Epoch 28, batch 3000, loss[loss=0.1833, simple_loss=0.2788, pruned_loss=0.04387, over 16961.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2774, pruned_loss=0.04632, over 25471328.88 frames. ], batch size: 138, lr: 1.23e-03, grad_scale: 16.0 2023-10-14 15:08:46,517 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-10-14 15:08:54,208 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:09:26,120 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1734674.6666666667, ans=0.125 2023-10-14 15:09:38,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1734721.3333333333, ans=0.09899494936611666 2023-10-14 15:10:00,302 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:10:04,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1734768.0, ans=0.125 2023-10-14 15:10:07,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.771e+02 1.901e+02 2.082e+02 2.891e+02, threshold=3.802e+02, percent-clipped=0.0 2023-10-14 15:10:09,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1734814.6666666667, ans=0.0 2023-10-14 15:11:41,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1735048.0, ans=0.2 2023-10-14 15:11:46,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1735094.6666666667, ans=0.125 2023-10-14 15:11:47,098 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.42 vs. limit=12.0 2023-10-14 15:11:56,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1735094.6666666667, ans=0.2 2023-10-14 15:12:07,614 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:12:12,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1735141.3333333333, ans=0.0 2023-10-14 15:12:24,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1735188.0, ans=0.125 2023-10-14 15:12:27,264 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.48 vs. limit=12.0 2023-10-14 15:12:38,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735234.6666666667, ans=0.1 2023-10-14 15:12:45,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.860e+02 1.988e+02 2.175e+02 3.016e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-14 15:13:14,902 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-10-14 15:13:32,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1735374.6666666667, ans=0.0 2023-10-14 15:13:49,454 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1735421.3333333333, ans=0.125 2023-10-14 15:14:11,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1735514.6666666667, ans=0.125 2023-10-14 15:14:25,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1735561.3333333333, ans=0.0 2023-10-14 15:14:32,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1735561.3333333333, ans=0.125 2023-10-14 15:14:41,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1735561.3333333333, ans=0.125 2023-10-14 15:14:54,774 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-10-14 15:15:32,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1735701.3333333333, ans=0.125 2023-10-14 15:15:47,510 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.882e+02 2.003e+02 2.133e+02 3.144e+02, threshold=4.005e+02, percent-clipped=0.0 2023-10-14 15:15:48,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-10-14 15:16:22,190 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1735794.6666666667, ans=0.0 2023-10-14 15:16:23,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1735794.6666666667, ans=0.1 2023-10-14 15:16:26,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.95 vs. limit=15.0 2023-10-14 15:16:41,994 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.15 vs. limit=15.0 2023-10-14 15:16:46,183 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1735888.0, ans=0.125 2023-10-14 15:16:55,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1735888.0, ans=0.125 2023-10-14 15:16:58,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1735888.0, ans=0.125 2023-10-14 15:17:00,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1735934.6666666667, ans=0.125 2023-10-14 15:18:02,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.28 vs. limit=22.5 2023-10-14 15:18:43,238 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:18:57,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1736168.0, ans=0.125 2023-10-14 15:19:04,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1736168.0, ans=0.0 2023-10-14 15:19:20,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.896e+02 2.034e+02 2.281e+02 3.076e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-14 15:19:20,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1736214.6666666667, ans=0.95 2023-10-14 15:19:25,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1736214.6666666667, ans=0.2 2023-10-14 15:19:46,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1736261.3333333333, ans=0.125 2023-10-14 15:20:30,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1736308.0, ans=0.1 2023-10-14 15:21:30,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.27 vs. limit=10.0 2023-10-14 15:21:57,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1736448.0, ans=0.125 2023-10-14 15:22:26,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1736541.3333333333, ans=0.1 2023-10-14 15:22:26,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1736541.3333333333, ans=0.125 2023-10-14 15:22:39,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-10-14 15:23:39,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.914e+02 2.061e+02 2.289e+02 2.978e+02, threshold=4.121e+02, percent-clipped=0.0 2023-10-14 15:23:49,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1736681.3333333333, ans=0.0 2023-10-14 15:23:53,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1736681.3333333333, ans=0.0 2023-10-14 15:23:55,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736681.3333333333, ans=0.1 2023-10-14 15:24:23,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1736728.0, ans=0.1 2023-10-14 15:24:24,692 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:24:49,579 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1736774.6666666667, ans=0.125 2023-10-14 15:25:16,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1736821.3333333333, ans=0.0 2023-10-14 15:25:21,373 INFO [train.py:1031] (0/4) Epoch 28, batch 3500, loss[loss=0.1989, simple_loss=0.2899, pruned_loss=0.05399, over 16873.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2775, pruned_loss=0.04663, over 27068421.20 frames. ], batch size: 188, lr: 1.23e-03, grad_scale: 16.0 2023-10-14 15:25:38,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1736914.6666666667, ans=0.125 2023-10-14 15:26:23,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1736961.3333333333, ans=0.07 2023-10-14 15:26:41,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1737008.0, ans=0.0 2023-10-14 15:26:47,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1737008.0, ans=0.1 2023-10-14 15:26:59,923 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.39 vs. limit=15.0 2023-10-14 15:27:37,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1737101.3333333333, ans=0.125 2023-10-14 15:27:56,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1737101.3333333333, ans=0.0 2023-10-14 15:28:00,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.860e+02 2.008e+02 2.208e+02 4.351e+02, threshold=4.017e+02, percent-clipped=1.0 2023-10-14 15:28:21,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1737148.0, ans=0.125 2023-10-14 15:28:41,168 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:28:41,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1737194.6666666667, ans=0.09899494936611666 2023-10-14 15:28:51,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1737241.3333333333, ans=0.0 2023-10-14 15:28:53,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1737241.3333333333, ans=0.1 2023-10-14 15:29:44,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1737288.0, ans=0.1 2023-10-14 15:30:07,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1737334.6666666667, ans=0.125 2023-10-14 15:31:13,681 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:31:15,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1737428.0, ans=0.125 2023-10-14 15:33:01,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.806e+02 1.972e+02 2.156e+02 3.674e+02, threshold=3.943e+02, percent-clipped=0.0 2023-10-14 15:33:45,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1737661.3333333333, ans=0.2 2023-10-14 15:33:49,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1737661.3333333333, ans=0.1 2023-10-14 15:33:59,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.79 vs. limit=15.0 2023-10-14 15:34:20,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1737708.0, ans=0.0 2023-10-14 15:34:30,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1737754.6666666667, ans=0.125 2023-10-14 15:34:38,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=1737754.6666666667, ans=0.2 2023-10-14 15:34:52,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1737801.3333333333, ans=0.125 2023-10-14 15:35:07,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1737848.0, ans=0.2 2023-10-14 15:35:09,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1737848.0, ans=10.0 2023-10-14 15:35:09,660 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.80 vs. limit=15.0 2023-10-14 15:35:44,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1737941.3333333333, ans=0.0 2023-10-14 15:35:44,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1737941.3333333333, ans=0.035 2023-10-14 15:35:51,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1737941.3333333333, ans=0.125 2023-10-14 15:36:20,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1738034.6666666667, ans=0.125 2023-10-14 15:36:26,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1738081.3333333333, ans=0.125 2023-10-14 15:36:26,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.786e+02 2.013e+02 2.207e+02 2.682e+02, threshold=4.027e+02, percent-clipped=0.0 2023-10-14 15:36:33,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1738081.3333333333, ans=0.09899494936611666 2023-10-14 15:36:49,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1738128.0, ans=0.125 2023-10-14 15:36:55,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1738174.6666666667, ans=0.125 2023-10-14 15:37:06,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1738221.3333333333, ans=0.0 2023-10-14 15:37:23,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1738268.0, ans=0.07 2023-10-14 15:37:30,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1738314.6666666667, ans=0.0 2023-10-14 15:37:48,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1738361.3333333333, ans=0.125 2023-10-14 15:37:52,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1738361.3333333333, ans=0.5 2023-10-14 15:38:07,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1738454.6666666667, ans=0.2 2023-10-14 15:38:29,556 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.788e+02 1.940e+02 2.196e+02 3.463e+02, threshold=3.880e+02, percent-clipped=0.0 2023-10-14 15:38:37,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=22.5 2023-10-14 15:38:54,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1738641.3333333333, ans=0.0 2023-10-14 15:39:08,322 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1738688.0, ans=0.125 2023-10-14 15:39:32,870 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1738781.3333333333, ans=0.1 2023-10-14 15:39:48,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1738874.6666666667, ans=0.1 2023-10-14 15:39:48,990 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.28 vs. limit=10.0 2023-10-14 15:39:50,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1738874.6666666667, ans=0.07 2023-10-14 15:39:57,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1738921.3333333333, ans=0.125 2023-10-14 15:40:18,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1738968.0, ans=0.125 2023-10-14 15:40:21,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.788e+02 1.984e+02 2.218e+02 3.006e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-14 15:40:56,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1739154.6666666667, ans=10.0 2023-10-14 15:41:01,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1739154.6666666667, ans=0.025 2023-10-14 15:41:04,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1739154.6666666667, ans=0.09899494936611666 2023-10-14 15:41:09,641 INFO [train.py:1031] (0/4) Epoch 28, batch 4000, loss[loss=0.1982, simple_loss=0.2879, pruned_loss=0.05426, over 15569.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.277, pruned_loss=0.0467, over 28327340.08 frames. ], batch size: 35, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 15:41:10,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1739201.3333333333, ans=0.0 2023-10-14 15:41:22,634 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:41:56,363 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1739341.3333333333, ans=0.1 2023-10-14 15:42:02,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.03 vs. limit=15.0 2023-10-14 15:42:22,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.840e+02 2.041e+02 2.289e+02 3.355e+02, threshold=4.081e+02, percent-clipped=0.0 2023-10-14 15:42:30,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1739481.3333333333, ans=0.125 2023-10-14 15:42:33,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1739528.0, ans=0.0 2023-10-14 15:42:36,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1739528.0, ans=0.0 2023-10-14 15:42:42,659 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.11 vs. limit=10.0 2023-10-14 15:42:43,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1739574.6666666667, ans=0.0 2023-10-14 15:42:43,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1739574.6666666667, ans=0.125 2023-10-14 15:42:44,385 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=12.0 2023-10-14 15:42:47,738 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1739574.6666666667, ans=0.0 2023-10-14 15:42:50,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1739574.6666666667, ans=0.2 2023-10-14 15:42:51,172 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=15.0 2023-10-14 15:43:07,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=12.19 vs. limit=15.0 2023-10-14 15:43:09,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1739668.0, ans=0.125 2023-10-14 15:43:11,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1739668.0, ans=0.2 2023-10-14 15:43:12,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1739668.0, ans=0.125 2023-10-14 15:43:15,815 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-10-14 15:43:53,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1739854.6666666667, ans=0.1 2023-10-14 15:43:54,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1739854.6666666667, ans=0.125 2023-10-14 15:44:02,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1739854.6666666667, ans=0.1 2023-10-14 15:44:17,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.841e+02 1.969e+02 2.167e+02 3.462e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-14 15:44:18,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1739948.0, ans=0.0 2023-10-14 15:44:27,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1739948.0, ans=0.125 2023-10-14 15:45:38,933 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.87 vs. limit=15.0 2023-10-14 15:45:44,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1740228.0, ans=0.125 2023-10-14 15:45:47,380 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.98 vs. limit=22.5 2023-10-14 15:45:48,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1740228.0, ans=0.1 2023-10-14 15:46:09,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1740321.3333333333, ans=0.0 2023-10-14 15:46:12,257 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1740321.3333333333, ans=0.1 2023-10-14 15:46:24,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1740368.0, ans=0.125 2023-10-14 15:46:31,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.884e+02 2.041e+02 2.273e+02 3.474e+02, threshold=4.082e+02, percent-clipped=0.0 2023-10-14 15:46:42,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1740461.3333333333, ans=0.125 2023-10-14 15:46:43,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1740461.3333333333, ans=0.125 2023-10-14 15:46:45,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1740461.3333333333, ans=15.0 2023-10-14 15:46:46,861 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-10-14 15:46:46,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1740461.3333333333, ans=15.0 2023-10-14 15:46:56,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1740508.0, ans=0.0 2023-10-14 15:46:58,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1740508.0, ans=0.125 2023-10-14 15:47:13,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1740601.3333333333, ans=0.0 2023-10-14 15:47:22,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.76 vs. limit=15.0 2023-10-14 15:47:22,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1740601.3333333333, ans=15.0 2023-10-14 15:47:35,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1740694.6666666667, ans=0.1 2023-10-14 15:47:35,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-10-14 15:47:44,960 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.09 vs. limit=22.5 2023-10-14 15:47:46,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1740741.3333333333, ans=0.05 2023-10-14 15:47:52,597 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.75 vs. limit=15.0 2023-10-14 15:48:00,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1740788.0, ans=0.125 2023-10-14 15:48:21,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.896e+02 2.054e+02 2.245e+02 3.668e+02, threshold=4.108e+02, percent-clipped=0.0 2023-10-14 15:48:51,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1740974.6666666667, ans=0.0 2023-10-14 15:49:01,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1741021.3333333333, ans=0.0 2023-10-14 15:49:06,459 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.50 vs. limit=15.0 2023-10-14 15:49:13,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1741068.0, ans=0.125 2023-10-14 15:49:23,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1741114.6666666667, ans=0.07 2023-10-14 15:49:23,543 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-10-14 15:49:26,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1741114.6666666667, ans=0.2 2023-10-14 15:49:35,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1741161.3333333333, ans=0.09899494936611666 2023-10-14 15:49:55,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1741208.0, ans=15.0 2023-10-14 15:50:16,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1741301.3333333333, ans=0.125 2023-10-14 15:50:29,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.917e+02 2.089e+02 2.285e+02 3.043e+02, threshold=4.177e+02, percent-clipped=0.0 2023-10-14 15:51:11,894 INFO [train.py:1031] (0/4) Epoch 28, batch 4500, loss[loss=0.1957, simple_loss=0.2923, pruned_loss=0.04953, over 16866.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2774, pruned_loss=0.04656, over 29305151.13 frames. ], batch size: 188, lr: 1.23e-03, grad_scale: 16.0 2023-10-14 15:51:17,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1741534.6666666667, ans=0.125 2023-10-14 15:51:24,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1741581.3333333333, ans=0.125 2023-10-14 15:51:32,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1741628.0, ans=0.2 2023-10-14 15:51:36,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1741628.0, ans=0.0 2023-10-14 15:52:23,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.849e+02 2.062e+02 2.330e+02 3.275e+02, threshold=4.124e+02, percent-clipped=0.0 2023-10-14 15:52:40,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1741861.3333333333, ans=0.2 2023-10-14 15:53:00,398 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1741954.6666666667, ans=0.0 2023-10-14 15:53:09,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1742001.3333333333, ans=0.125 2023-10-14 15:53:14,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1742048.0, ans=0.125 2023-10-14 15:53:20,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1742048.0, ans=0.09899494936611666 2023-10-14 15:53:36,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1742141.3333333333, ans=0.125 2023-10-14 15:53:38,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1742141.3333333333, ans=0.0 2023-10-14 15:53:53,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1742188.0, ans=0.125 2023-10-14 15:53:58,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1742234.6666666667, ans=0.0 2023-10-14 15:53:58,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1742234.6666666667, ans=0.125 2023-10-14 15:54:07,758 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-10-14 15:54:12,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.804e+02 2.036e+02 2.187e+02 2.860e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-14 15:54:48,745 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.65 vs. limit=15.0 2023-10-14 15:54:49,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1742421.3333333333, ans=15.0 2023-10-14 15:54:56,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1742468.0, ans=0.125 2023-10-14 15:55:26,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1742561.3333333333, ans=0.125 2023-10-14 15:55:26,528 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.12 vs. limit=22.5 2023-10-14 15:55:33,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1742608.0, ans=0.0 2023-10-14 15:55:33,473 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-10-14 15:55:39,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1742654.6666666667, ans=0.125 2023-10-14 15:55:46,534 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-10-14 15:56:04,228 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.855e+02 2.027e+02 2.262e+02 3.196e+02, threshold=4.053e+02, percent-clipped=0.0 2023-10-14 15:56:10,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1742794.6666666667, ans=0.0 2023-10-14 15:56:11,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.79 vs. limit=10.0 2023-10-14 15:56:34,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1742888.0, ans=0.125 2023-10-14 15:56:50,779 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.63 vs. limit=6.0 2023-10-14 15:57:16,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1743028.0, ans=0.0 2023-10-14 15:57:18,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1743028.0, ans=0.125 2023-10-14 15:57:55,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1743168.0, ans=0.1 2023-10-14 15:58:00,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.830e+02 1.950e+02 2.107e+02 3.143e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-14 15:58:17,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1743261.3333333333, ans=0.0 2023-10-14 15:58:20,507 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.07 vs. limit=15.0 2023-10-14 15:58:22,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1743308.0, ans=0.0 2023-10-14 15:58:24,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1743308.0, ans=0.1 2023-10-14 15:58:39,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1743354.6666666667, ans=0.125 2023-10-14 15:58:43,173 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-10-14 15:58:50,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1743401.3333333333, ans=0.0 2023-10-14 15:59:02,320 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.18 vs. limit=8.0 2023-10-14 15:59:04,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1743448.0, ans=0.0 2023-10-14 15:59:16,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1743494.6666666667, ans=0.2 2023-10-14 15:59:28,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1743541.3333333333, ans=0.04949747468305833 2023-10-14 15:59:43,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1743634.6666666667, ans=0.1 2023-10-14 15:59:45,917 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 15:59:48,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1743634.6666666667, ans=0.125 2023-10-14 15:59:50,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1743634.6666666667, ans=0.0 2023-10-14 15:59:58,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.794e+02 1.979e+02 2.137e+02 2.991e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-14 15:59:58,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1743681.3333333333, ans=0.125 2023-10-14 16:00:05,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1743728.0, ans=0.1 2023-10-14 16:00:16,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1743774.6666666667, ans=0.125 2023-10-14 16:00:20,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-10-14 16:00:20,400 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.74 vs. limit=6.0 2023-10-14 16:00:25,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1743774.6666666667, ans=0.125 2023-10-14 16:00:27,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1743821.3333333333, ans=0.125 2023-10-14 16:00:35,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1743821.3333333333, ans=0.0 2023-10-14 16:00:37,929 INFO [train.py:1031] (0/4) Epoch 28, batch 5000, loss[loss=0.1793, simple_loss=0.2751, pruned_loss=0.0417, over 16908.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2772, pruned_loss=0.04656, over 30079656.04 frames. ], batch size: 165, lr: 1.23e-03, grad_scale: 32.0 2023-10-14 16:00:52,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1743914.6666666667, ans=0.0 2023-10-14 16:00:56,779 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.12 vs. limit=22.5 2023-10-14 16:01:18,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1744008.0, ans=0.0 2023-10-14 16:01:46,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1744101.3333333333, ans=0.125 2023-10-14 16:01:50,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.864e+02 2.085e+02 2.384e+02 3.644e+02, threshold=4.169e+02, percent-clipped=0.0 2023-10-14 16:01:57,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1744194.6666666667, ans=0.0 2023-10-14 16:02:07,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1744194.6666666667, ans=0.0 2023-10-14 16:02:10,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1744241.3333333333, ans=0.05 2023-10-14 16:02:27,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1744288.0, ans=0.0 2023-10-14 16:03:40,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1744568.0, ans=0.125 2023-10-14 16:03:44,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1744568.0, ans=0.125 2023-10-14 16:03:48,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.94 vs. limit=15.0 2023-10-14 16:03:52,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.902e+02 2.070e+02 2.236e+02 3.472e+02, threshold=4.140e+02, percent-clipped=0.0 2023-10-14 16:03:59,109 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=15.0 2023-10-14 16:04:13,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1744708.0, ans=0.0 2023-10-14 16:04:14,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1744708.0, ans=0.95 2023-10-14 16:04:23,176 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.06 vs. limit=15.0 2023-10-14 16:04:34,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1744801.3333333333, ans=0.0 2023-10-14 16:05:09,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1744941.3333333333, ans=0.125 2023-10-14 16:05:11,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1744941.3333333333, ans=0.0 2023-10-14 16:05:14,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1744988.0, ans=0.125 2023-10-14 16:05:46,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.816e+02 1.943e+02 2.169e+02 3.281e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-14 16:05:49,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1745081.3333333333, ans=0.0 2023-10-14 16:06:11,152 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.92 vs. limit=15.0 2023-10-14 16:06:38,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1745268.0, ans=0.125 2023-10-14 16:06:41,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1745268.0, ans=0.125 2023-10-14 16:06:41,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1745268.0, ans=0.07 2023-10-14 16:06:41,852 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1745268.0, ans=0.07 2023-10-14 16:06:52,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1745314.6666666667, ans=0.1 2023-10-14 16:07:24,113 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=15.0 2023-10-14 16:07:52,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.708e+02 1.854e+02 2.089e+02 2.784e+02, threshold=3.707e+02, percent-clipped=0.0 2023-10-14 16:08:26,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1745688.0, ans=0.0 2023-10-14 16:08:48,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1745781.3333333333, ans=0.09899494936611666 2023-10-14 16:08:54,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1745828.0, ans=0.1 2023-10-14 16:09:34,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1745968.0, ans=0.125 2023-10-14 16:09:42,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.876e+02 2.053e+02 2.249e+02 3.068e+02, threshold=4.106e+02, percent-clipped=0.0 2023-10-14 16:09:44,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1746014.6666666667, ans=0.0 2023-10-14 16:09:52,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1746061.3333333333, ans=0.025 2023-10-14 16:09:53,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1746061.3333333333, ans=0.0 2023-10-14 16:09:56,652 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1746061.3333333333, ans=0.0 2023-10-14 16:10:07,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1746108.0, ans=0.125 2023-10-14 16:10:15,431 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:10:20,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1746201.3333333333, ans=0.125 2023-10-14 16:10:20,619 INFO [train.py:1031] (0/4) Epoch 28, batch 5500, loss[loss=0.1874, simple_loss=0.2823, pruned_loss=0.04628, over 16481.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2769, pruned_loss=0.04641, over 30670044.09 frames. ], batch size: 266, lr: 1.23e-03, grad_scale: 16.0 2023-10-14 16:10:20,917 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1746201.3333333333, ans=0.125 2023-10-14 16:10:48,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1746294.6666666667, ans=0.2 2023-10-14 16:10:55,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1746341.3333333333, ans=0.0 2023-10-14 16:11:01,216 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.70 vs. limit=15.0 2023-10-14 16:11:17,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1746434.6666666667, ans=0.125 2023-10-14 16:11:23,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746434.6666666667, ans=0.1 2023-10-14 16:11:29,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1746481.3333333333, ans=0.0 2023-10-14 16:11:32,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.777e+02 1.973e+02 2.149e+02 2.980e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-14 16:11:38,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1746528.0, ans=0.125 2023-10-14 16:11:45,099 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-10-14 16:11:52,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1746574.6666666667, ans=0.125 2023-10-14 16:12:22,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.09 vs. limit=10.0 2023-10-14 16:12:26,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1746714.6666666667, ans=0.125 2023-10-14 16:12:32,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1746761.3333333333, ans=0.125 2023-10-14 16:12:34,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1746761.3333333333, ans=0.125 2023-10-14 16:12:43,455 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2023-10-14 16:12:48,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1746808.0, ans=0.2 2023-10-14 16:12:51,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=22.5 2023-10-14 16:13:01,729 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.78 vs. limit=22.5 2023-10-14 16:13:16,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1746948.0, ans=0.09899494936611666 2023-10-14 16:13:22,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1746948.0, ans=0.1 2023-10-14 16:13:23,196 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.843e+02 1.960e+02 2.255e+02 3.918e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-14 16:13:23,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1746948.0, ans=0.125 2023-10-14 16:13:28,455 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.301e-01 2023-10-14 16:13:50,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1747041.3333333333, ans=0.125 2023-10-14 16:13:55,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1747088.0, ans=0.0 2023-10-14 16:14:07,375 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.53 vs. limit=15.0 2023-10-14 16:14:09,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1747134.6666666667, ans=0.0 2023-10-14 16:14:16,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1747134.6666666667, ans=0.1 2023-10-14 16:14:16,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-10-14 16:14:26,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1747181.3333333333, ans=0.0 2023-10-14 16:14:31,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1747228.0, ans=0.125 2023-10-14 16:14:41,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1747274.6666666667, ans=0.04949747468305833 2023-10-14 16:14:42,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1747274.6666666667, ans=0.1 2023-10-14 16:14:49,258 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1747274.6666666667, ans=0.125 2023-10-14 16:14:51,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1747274.6666666667, ans=0.125 2023-10-14 16:15:06,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.25 vs. limit=15.0 2023-10-14 16:15:19,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1747414.6666666667, ans=0.125 2023-10-14 16:15:21,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1747414.6666666667, ans=0.125 2023-10-14 16:15:23,551 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.852e+02 2.049e+02 2.367e+02 3.817e+02, threshold=4.098e+02, percent-clipped=0.0 2023-10-14 16:15:50,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1747508.0, ans=0.1 2023-10-14 16:16:03,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1747601.3333333333, ans=0.07 2023-10-14 16:16:04,779 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.83 vs. limit=12.0 2023-10-14 16:16:06,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1747601.3333333333, ans=0.125 2023-10-14 16:16:45,951 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-10-14 16:17:05,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1747834.6666666667, ans=0.125 2023-10-14 16:17:18,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.829e+02 1.973e+02 2.183e+02 3.097e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-14 16:17:24,702 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.41 vs. limit=15.0 2023-10-14 16:17:45,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1748021.3333333333, ans=0.1 2023-10-14 16:17:46,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1748021.3333333333, ans=0.0 2023-10-14 16:18:12,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1748114.6666666667, ans=0.125 2023-10-14 16:18:24,065 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:18:40,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1748208.0, ans=0.0 2023-10-14 16:19:16,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.836e+02 1.937e+02 2.254e+02 2.887e+02, threshold=3.875e+02, percent-clipped=0.0 2023-10-14 16:19:25,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1748394.6666666667, ans=0.0 2023-10-14 16:19:46,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1748488.0, ans=0.125 2023-10-14 16:19:54,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1748488.0, ans=0.0 2023-10-14 16:19:57,137 INFO [train.py:1031] (0/4) Epoch 28, batch 6000, loss[loss=0.1749, simple_loss=0.2696, pruned_loss=0.04008, over 16863.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2774, pruned_loss=0.04671, over 31134180.09 frames. ], batch size: 87, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 16:20:21,763 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1748628.0, ans=0.2 2023-10-14 16:20:27,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1748628.0, ans=0.125 2023-10-14 16:21:16,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.904e+02 2.046e+02 2.284e+02 3.220e+02, threshold=4.093e+02, percent-clipped=0.0 2023-10-14 16:21:29,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1748861.3333333333, ans=0.125 2023-10-14 16:21:32,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1748908.0, ans=0.125 2023-10-14 16:21:41,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1748908.0, ans=0.1 2023-10-14 16:21:47,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1748954.6666666667, ans=0.125 2023-10-14 16:21:52,582 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-10-14 16:21:53,341 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.766e-02 2023-10-14 16:21:57,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.40 vs. limit=15.0 2023-10-14 16:21:57,713 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=22.5 2023-10-14 16:22:03,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1749001.3333333333, ans=0.2 2023-10-14 16:22:03,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1749001.3333333333, ans=0.0 2023-10-14 16:22:32,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1749141.3333333333, ans=0.2 2023-10-14 16:22:45,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1749188.0, ans=0.07 2023-10-14 16:22:51,504 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1749188.0, ans=0.0 2023-10-14 16:22:59,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1749234.6666666667, ans=0.0 2023-10-14 16:23:08,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1749281.3333333333, ans=0.125 2023-10-14 16:23:10,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.888e+02 2.018e+02 2.306e+02 5.116e+02, threshold=4.036e+02, percent-clipped=1.0 2023-10-14 16:23:34,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1749374.6666666667, ans=0.125 2023-10-14 16:23:45,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1749421.3333333333, ans=0.2 2023-10-14 16:23:47,857 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.80 vs. limit=15.0 2023-10-14 16:24:35,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1749608.0, ans=0.0 2023-10-14 16:24:36,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1749608.0, ans=0.07 2023-10-14 16:24:37,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1749608.0, ans=0.0 2023-10-14 16:24:53,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1749654.6666666667, ans=0.125 2023-10-14 16:25:00,247 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.13 vs. limit=22.5 2023-10-14 16:25:01,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1749701.3333333333, ans=0.1 2023-10-14 16:25:11,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1749748.0, ans=0.125 2023-10-14 16:25:15,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.890e+02 2.049e+02 2.209e+02 3.016e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-14 16:25:29,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1749794.6666666667, ans=0.1 2023-10-14 16:25:31,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1749841.3333333333, ans=0.125 2023-10-14 16:25:36,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1749841.3333333333, ans=0.0 2023-10-14 16:25:39,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1749841.3333333333, ans=0.125 2023-10-14 16:25:51,415 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.39 vs. limit=15.0 2023-10-14 16:25:58,974 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-10-14 16:25:59,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1749934.6666666667, ans=0.2 2023-10-14 16:26:04,215 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.31 vs. limit=10.0 2023-10-14 16:26:34,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1750074.6666666667, ans=0.2 2023-10-14 16:26:37,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1750074.6666666667, ans=0.2 2023-10-14 16:26:40,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1750074.6666666667, ans=0.125 2023-10-14 16:26:50,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1750121.3333333333, ans=0.0 2023-10-14 16:26:53,948 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.64 vs. limit=6.0 2023-10-14 16:26:54,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1750121.3333333333, ans=0.125 2023-10-14 16:27:18,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750214.6666666667, ans=0.1 2023-10-14 16:27:21,310 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.32 vs. limit=15.0 2023-10-14 16:27:23,130 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 1.862e+02 2.008e+02 2.228e+02 3.432e+02, threshold=4.017e+02, percent-clipped=0.0 2023-10-14 16:27:33,793 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-10-14 16:27:35,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-10-14 16:27:37,988 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.80 vs. limit=15.0 2023-10-14 16:27:39,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1750261.3333333333, ans=0.0 2023-10-14 16:27:52,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1750308.0, ans=0.2 2023-10-14 16:27:59,053 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.02 vs. limit=12.0 2023-10-14 16:27:59,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1750354.6666666667, ans=0.125 2023-10-14 16:28:05,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1750401.3333333333, ans=0.0 2023-10-14 16:28:15,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1750401.3333333333, ans=0.125 2023-10-14 16:28:33,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750494.6666666667, ans=0.1 2023-10-14 16:28:42,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1750541.3333333333, ans=0.0 2023-10-14 16:28:53,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1750541.3333333333, ans=0.0 2023-10-14 16:28:58,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1750588.0, ans=0.125 2023-10-14 16:29:17,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1750634.6666666667, ans=0.5 2023-10-14 16:29:26,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.809e+02 2.005e+02 2.346e+02 2.936e+02, threshold=4.010e+02, percent-clipped=0.0 2023-10-14 16:29:52,285 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1750774.6666666667, ans=0.2 2023-10-14 16:30:05,680 INFO [train.py:1031] (0/4) Epoch 28, batch 6500, loss[loss=0.1686, simple_loss=0.2596, pruned_loss=0.0388, over 16034.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2778, pruned_loss=0.0468, over 31488575.87 frames. ], batch size: 43, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 16:30:16,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1750868.0, ans=0.2 2023-10-14 16:30:18,789 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.19 vs. limit=15.0 2023-10-14 16:30:19,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1750914.6666666667, ans=0.1 2023-10-14 16:30:25,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1750914.6666666667, ans=0.0 2023-10-14 16:30:48,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=1751008.0, ans=0.5 2023-10-14 16:30:50,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1751008.0, ans=0.09899494936611666 2023-10-14 16:30:55,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1751008.0, ans=0.125 2023-10-14 16:30:55,859 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.34 vs. limit=22.5 2023-10-14 16:31:01,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1751008.0, ans=0.0 2023-10-14 16:31:11,571 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.67 vs. limit=15.0 2023-10-14 16:31:16,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1751101.3333333333, ans=0.2 2023-10-14 16:31:19,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1751101.3333333333, ans=0.0 2023-10-14 16:31:34,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.894e+02 2.087e+02 2.340e+02 3.129e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-14 16:31:35,198 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.39 vs. limit=15.0 2023-10-14 16:31:36,212 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.65 vs. limit=15.0 2023-10-14 16:31:46,034 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.43 vs. limit=15.0 2023-10-14 16:31:59,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1751241.3333333333, ans=0.125 2023-10-14 16:32:06,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-10-14 16:32:16,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1751334.6666666667, ans=0.125 2023-10-14 16:32:16,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1751334.6666666667, ans=0.2 2023-10-14 16:32:30,438 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.56 vs. limit=15.0 2023-10-14 16:32:31,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-14 16:32:51,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.07 vs. limit=15.0 2023-10-14 16:33:07,328 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1751521.3333333333, ans=0.125 2023-10-14 16:33:17,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1751568.0, ans=0.125 2023-10-14 16:33:27,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.840e+02 1.993e+02 2.191e+02 3.127e+02, threshold=3.986e+02, percent-clipped=0.0 2023-10-14 16:33:29,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1751661.3333333333, ans=0.0 2023-10-14 16:33:40,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1751661.3333333333, ans=0.125 2023-10-14 16:33:41,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1751708.0, ans=0.125 2023-10-14 16:34:04,293 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:34:22,504 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.59 vs. limit=15.0 2023-10-14 16:34:31,393 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-10-14 16:34:35,040 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1751894.6666666667, ans=0.0 2023-10-14 16:34:46,212 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1751941.3333333333, ans=0.0 2023-10-14 16:34:49,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1751941.3333333333, ans=0.125 2023-10-14 16:34:58,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-10-14 16:34:59,458 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1751988.0, ans=0.5 2023-10-14 16:35:04,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1752034.6666666667, ans=0.0 2023-10-14 16:35:10,875 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1752034.6666666667, ans=0.2 2023-10-14 16:35:13,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1752034.6666666667, ans=0.05 2023-10-14 16:35:21,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1752081.3333333333, ans=0.125 2023-10-14 16:35:26,737 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.854e+02 2.053e+02 2.343e+02 3.091e+02, threshold=4.107e+02, percent-clipped=0.0 2023-10-14 16:35:28,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1752128.0, ans=0.125 2023-10-14 16:35:29,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1752128.0, ans=0.1 2023-10-14 16:35:33,587 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1752128.0, ans=0.125 2023-10-14 16:36:27,758 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:36:36,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752314.6666666667, ans=0.1 2023-10-14 16:37:13,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1752454.6666666667, ans=0.2 2023-10-14 16:37:29,330 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-14 16:37:41,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.335e+02 1.844e+02 1.978e+02 2.217e+02 2.975e+02, threshold=3.956e+02, percent-clipped=0.0 2023-10-14 16:38:01,600 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=22.5 2023-10-14 16:38:03,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1752641.3333333333, ans=0.0 2023-10-14 16:38:10,201 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-10-14 16:38:52,954 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752874.6666666667, ans=0.1 2023-10-14 16:39:19,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1752968.0, ans=0.2 2023-10-14 16:39:24,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1752968.0, ans=0.1 2023-10-14 16:39:34,114 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.863e+02 2.034e+02 2.261e+02 2.701e+02, threshold=4.068e+02, percent-clipped=0.0 2023-10-14 16:39:44,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1753061.3333333333, ans=0.025 2023-10-14 16:39:51,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1753108.0, ans=0.025 2023-10-14 16:40:06,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1753154.6666666667, ans=0.1 2023-10-14 16:40:09,774 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.44 vs. limit=12.0 2023-10-14 16:40:12,165 INFO [train.py:1031] (0/4) Epoch 28, batch 7000, loss[loss=0.1961, simple_loss=0.2938, pruned_loss=0.04921, over 16785.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2784, pruned_loss=0.0468, over 31797645.29 frames. ], batch size: 188, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 16:40:41,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.85 vs. limit=22.5 2023-10-14 16:40:50,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1753341.3333333333, ans=0.2 2023-10-14 16:40:54,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1753341.3333333333, ans=0.05 2023-10-14 16:40:55,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1753341.3333333333, ans=0.07 2023-10-14 16:40:55,656 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.35 vs. limit=15.0 2023-10-14 16:41:00,502 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.19 vs. limit=22.5 2023-10-14 16:41:08,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1753388.0, ans=0.125 2023-10-14 16:41:37,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.900e+02 2.057e+02 2.255e+02 3.096e+02, threshold=4.115e+02, percent-clipped=0.0 2023-10-14 16:41:40,389 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.04 vs. limit=15.0 2023-10-14 16:42:10,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1753621.3333333333, ans=0.125 2023-10-14 16:42:19,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1753668.0, ans=0.0 2023-10-14 16:42:19,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1753668.0, ans=0.0 2023-10-14 16:42:35,842 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1753761.3333333333, ans=0.125 2023-10-14 16:42:48,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1753808.0, ans=0.125 2023-10-14 16:42:51,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1753808.0, ans=0.0 2023-10-14 16:43:02,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1753854.6666666667, ans=0.125 2023-10-14 16:43:04,677 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.01 vs. limit=10.0 2023-10-14 16:43:15,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1753901.3333333333, ans=0.125 2023-10-14 16:43:31,382 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.884e+02 2.087e+02 2.579e+02 3.779e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-14 16:43:46,181 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:43:50,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-10-14 16:43:56,653 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.29 vs. limit=15.0 2023-10-14 16:44:02,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1754088.0, ans=0.2 2023-10-14 16:44:07,726 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.32 vs. limit=22.5 2023-10-14 16:44:09,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1754134.6666666667, ans=0.2 2023-10-14 16:44:15,947 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.63 vs. limit=15.0 2023-10-14 16:44:35,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1754181.3333333333, ans=0.125 2023-10-14 16:44:35,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1754181.3333333333, ans=0.125 2023-10-14 16:45:32,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1754368.0, ans=0.125 2023-10-14 16:45:33,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.98 vs. limit=15.0 2023-10-14 16:45:39,374 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:45:42,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.831e+02 2.094e+02 2.306e+02 3.767e+02, threshold=4.188e+02, percent-clipped=0.0 2023-10-14 16:45:50,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1754461.3333333333, ans=0.2 2023-10-14 16:45:53,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=15.0 2023-10-14 16:46:38,025 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-376000.pt 2023-10-14 16:47:02,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1754694.6666666667, ans=0.0 2023-10-14 16:47:08,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1754741.3333333333, ans=0.0 2023-10-14 16:47:39,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1754834.6666666667, ans=0.1 2023-10-14 16:47:43,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-10-14 16:47:49,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.820e+02 2.004e+02 2.132e+02 3.572e+02, threshold=4.008e+02, percent-clipped=0.0 2023-10-14 16:48:04,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1754974.6666666667, ans=0.2 2023-10-14 16:48:17,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.60 vs. limit=15.0 2023-10-14 16:48:32,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1755068.0, ans=0.0 2023-10-14 16:48:40,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1755114.6666666667, ans=0.1 2023-10-14 16:48:46,337 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.49 vs. limit=22.5 2023-10-14 16:49:26,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1755301.3333333333, ans=0.125 2023-10-14 16:49:38,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1755348.0, ans=0.125 2023-10-14 16:49:42,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.874e+02 2.061e+02 2.316e+02 3.225e+02, threshold=4.121e+02, percent-clipped=0.0 2023-10-14 16:49:49,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1755394.6666666667, ans=0.2 2023-10-14 16:49:54,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755394.6666666667, ans=0.1 2023-10-14 16:50:22,044 INFO [train.py:1031] (0/4) Epoch 28, batch 7500, loss[loss=0.1846, simple_loss=0.2803, pruned_loss=0.04446, over 16507.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2783, pruned_loss=0.04678, over 32014649.96 frames. ], batch size: 241, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 16:50:32,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1755581.3333333333, ans=0.125 2023-10-14 16:50:46,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1755628.0, ans=0.0 2023-10-14 16:51:28,707 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1755768.0, ans=0.0 2023-10-14 16:51:35,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1755814.6666666667, ans=0.125 2023-10-14 16:51:37,481 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.00 vs. limit=15.0 2023-10-14 16:51:40,573 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.74 vs. limit=15.0 2023-10-14 16:51:40,964 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.885e+02 2.047e+02 2.279e+02 3.374e+02, threshold=4.093e+02, percent-clipped=0.0 2023-10-14 16:51:45,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1755861.3333333333, ans=0.0 2023-10-14 16:51:47,329 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:51:49,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1755861.3333333333, ans=0.0 2023-10-14 16:51:53,238 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:51:54,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755908.0, ans=0.1 2023-10-14 16:52:09,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1755954.6666666667, ans=0.05 2023-10-14 16:52:18,181 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.55 vs. limit=8.0 2023-10-14 16:52:25,199 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-10-14 16:52:26,238 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.77 vs. limit=10.0 2023-10-14 16:52:35,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1756048.0, ans=0.1 2023-10-14 16:53:29,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1756234.6666666667, ans=0.125 2023-10-14 16:53:46,467 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.792e+02 1.957e+02 2.147e+02 3.009e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-14 16:53:57,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1756328.0, ans=0.0 2023-10-14 16:54:09,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1756374.6666666667, ans=10.0 2023-10-14 16:54:15,087 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:54:29,019 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 16:54:30,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1756468.0, ans=0.1 2023-10-14 16:54:38,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1756514.6666666667, ans=0.125 2023-10-14 16:54:56,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1756561.3333333333, ans=0.0 2023-10-14 16:55:15,076 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.60 vs. limit=15.0 2023-10-14 16:55:30,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-10-14 16:55:40,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1756748.0, ans=0.125 2023-10-14 16:55:40,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1756748.0, ans=0.125 2023-10-14 16:55:43,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.793e+02 2.025e+02 2.489e+02 3.348e+02, threshold=4.050e+02, percent-clipped=0.0 2023-10-14 16:55:47,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1756794.6666666667, ans=0.1 2023-10-14 16:55:47,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1756794.6666666667, ans=0.0 2023-10-14 16:55:53,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-10-14 16:55:57,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1756794.6666666667, ans=0.1 2023-10-14 16:56:06,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1756841.3333333333, ans=0.0 2023-10-14 16:56:08,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1756841.3333333333, ans=0.125 2023-10-14 16:56:09,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1756888.0, ans=0.0 2023-10-14 16:56:16,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1756888.0, ans=0.125 2023-10-14 16:56:32,827 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-10-14 16:56:51,943 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1757028.0, ans=0.1 2023-10-14 16:57:17,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1757121.3333333333, ans=0.2 2023-10-14 16:57:23,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1757121.3333333333, ans=0.0 2023-10-14 16:57:33,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1757168.0, ans=0.04949747468305833 2023-10-14 16:57:45,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1757214.6666666667, ans=0.0 2023-10-14 16:57:47,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.908e+02 2.113e+02 2.337e+02 3.604e+02, threshold=4.226e+02, percent-clipped=0.0 2023-10-14 16:57:47,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1757214.6666666667, ans=0.125 2023-10-14 16:57:51,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.44 vs. limit=15.0 2023-10-14 16:58:23,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1757401.3333333333, ans=0.125 2023-10-14 16:58:33,474 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2023-10-14 16:58:37,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1757448.0, ans=0.125 2023-10-14 16:58:50,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1757494.6666666667, ans=0.04949747468305833 2023-10-14 16:58:59,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1757494.6666666667, ans=0.1 2023-10-14 16:59:31,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1757634.6666666667, ans=0.125 2023-10-14 16:59:52,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1757681.3333333333, ans=0.1 2023-10-14 16:59:54,651 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.725e+02 1.890e+02 2.105e+02 2.843e+02, threshold=3.781e+02, percent-clipped=0.0 2023-10-14 17:00:15,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1757774.6666666667, ans=0.125 2023-10-14 17:00:30,489 INFO [train.py:1031] (0/4) Epoch 28, batch 8000, loss[loss=0.1766, simple_loss=0.2751, pruned_loss=0.0391, over 16075.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2778, pruned_loss=0.04632, over 32199329.86 frames. ], batch size: 43, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 17:01:00,111 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-10-14 17:01:04,057 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.33 vs. limit=15.0 2023-10-14 17:01:41,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1758148.0, ans=0.125 2023-10-14 17:01:44,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.834e+02 2.033e+02 2.233e+02 2.884e+02, threshold=4.067e+02, percent-clipped=0.0 2023-10-14 17:01:59,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1758241.3333333333, ans=0.1 2023-10-14 17:02:32,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1758381.3333333333, ans=0.0 2023-10-14 17:02:44,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1758428.0, ans=0.125 2023-10-14 17:02:58,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1758474.6666666667, ans=0.125 2023-10-14 17:02:59,874 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-10-14 17:03:25,216 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.12 vs. limit=15.0 2023-10-14 17:03:32,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1758568.0, ans=0.0 2023-10-14 17:03:33,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1758568.0, ans=0.0 2023-10-14 17:03:34,246 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-10-14 17:03:44,136 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-10-14 17:03:48,514 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-10-14 17:03:49,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1758614.6666666667, ans=0.1 2023-10-14 17:03:49,990 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.21 vs. limit=22.5 2023-10-14 17:03:54,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.837e+02 1.973e+02 2.212e+02 2.905e+02, threshold=3.945e+02, percent-clipped=0.0 2023-10-14 17:04:12,148 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=15.0 2023-10-14 17:04:16,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1758708.0, ans=0.125 2023-10-14 17:04:22,676 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1758754.6666666667, ans=0.0 2023-10-14 17:04:23,071 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.45 vs. limit=6.0 2023-10-14 17:04:24,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1758754.6666666667, ans=0.2 2023-10-14 17:04:30,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1758754.6666666667, ans=0.125 2023-10-14 17:04:50,464 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-10-14 17:04:52,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1758848.0, ans=0.2 2023-10-14 17:05:07,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1758894.6666666667, ans=0.1 2023-10-14 17:05:20,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1758941.3333333333, ans=0.125 2023-10-14 17:05:55,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.800e+02 1.977e+02 2.156e+02 2.711e+02, threshold=3.954e+02, percent-clipped=0.0 2023-10-14 17:06:16,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1759174.6666666667, ans=0.125 2023-10-14 17:06:40,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1759268.0, ans=0.125 2023-10-14 17:07:07,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1759361.3333333333, ans=0.125 2023-10-14 17:07:49,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.786e+02 1.948e+02 2.121e+02 3.618e+02, threshold=3.896e+02, percent-clipped=0.0 2023-10-14 17:07:50,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1759548.0, ans=0.0 2023-10-14 17:07:50,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1759548.0, ans=0.0 2023-10-14 17:08:09,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1759641.3333333333, ans=0.0 2023-10-14 17:08:20,546 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:08:27,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1759734.6666666667, ans=0.1 2023-10-14 17:08:31,707 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:08:40,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1759781.3333333333, ans=0.2 2023-10-14 17:09:13,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1759874.6666666667, ans=0.1 2023-10-14 17:09:20,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.86 vs. limit=15.0 2023-10-14 17:09:46,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1760014.6666666667, ans=0.125 2023-10-14 17:09:48,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1760014.6666666667, ans=0.125 2023-10-14 17:09:52,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.867e+02 2.014e+02 2.166e+02 3.659e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-14 17:10:08,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1760108.0, ans=0.125 2023-10-14 17:10:10,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1760108.0, ans=0.035 2023-10-14 17:10:10,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1760108.0, ans=0.1 2023-10-14 17:10:14,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760108.0, ans=0.1 2023-10-14 17:10:31,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1760154.6666666667, ans=0.0 2023-10-14 17:10:33,687 INFO [train.py:1031] (0/4) Epoch 28, batch 8500, loss[loss=0.1798, simple_loss=0.2764, pruned_loss=0.04155, over 16903.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2782, pruned_loss=0.04622, over 32378161.58 frames. ], batch size: 77, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 17:10:37,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1760201.3333333333, ans=0.125 2023-10-14 17:10:38,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760201.3333333333, ans=0.1 2023-10-14 17:10:42,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1760201.3333333333, ans=0.125 2023-10-14 17:11:01,387 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=15.0 2023-10-14 17:11:17,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760341.3333333333, ans=0.1 2023-10-14 17:11:19,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1760388.0, ans=0.125 2023-10-14 17:11:25,780 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=8.0 2023-10-14 17:11:31,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1760434.6666666667, ans=15.0 2023-10-14 17:11:33,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1760434.6666666667, ans=0.125 2023-10-14 17:11:53,228 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.920e+02 2.089e+02 2.261e+02 3.381e+02, threshold=4.178e+02, percent-clipped=0.0 2023-10-14 17:12:03,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1760528.0, ans=0.1 2023-10-14 17:12:06,364 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.25 vs. limit=15.0 2023-10-14 17:12:07,208 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:12:08,988 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:12:26,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1760621.3333333333, ans=0.125 2023-10-14 17:12:29,935 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1760621.3333333333, ans=0.0 2023-10-14 17:12:48,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1760668.0, ans=0.05 2023-10-14 17:13:29,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1760854.6666666667, ans=0.1 2023-10-14 17:13:38,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1760901.3333333333, ans=0.125 2023-10-14 17:13:59,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.440e+02 1.785e+02 1.907e+02 2.084e+02 3.010e+02, threshold=3.813e+02, percent-clipped=0.0 2023-10-14 17:14:05,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1760994.6666666667, ans=0.125 2023-10-14 17:14:12,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1760994.6666666667, ans=0.125 2023-10-14 17:14:48,175 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-10-14 17:15:01,309 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.00 vs. limit=15.0 2023-10-14 17:15:02,402 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1761181.3333333333, ans=0.125 2023-10-14 17:15:07,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1761228.0, ans=0.0 2023-10-14 17:15:12,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1761228.0, ans=0.125 2023-10-14 17:15:23,305 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.46 vs. limit=22.5 2023-10-14 17:15:45,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.07 vs. limit=12.0 2023-10-14 17:16:06,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1761414.6666666667, ans=0.2 2023-10-14 17:16:08,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1761414.6666666667, ans=0.125 2023-10-14 17:16:08,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1761414.6666666667, ans=0.2 2023-10-14 17:16:10,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1761414.6666666667, ans=0.1 2023-10-14 17:16:14,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.775e+02 1.987e+02 2.278e+02 3.138e+02, threshold=3.975e+02, percent-clipped=0.0 2023-10-14 17:16:26,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1761508.0, ans=0.1 2023-10-14 17:16:26,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1761508.0, ans=0.125 2023-10-14 17:16:44,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1761554.6666666667, ans=0.2 2023-10-14 17:16:55,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1761601.3333333333, ans=0.0 2023-10-14 17:16:58,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1761601.3333333333, ans=0.125 2023-10-14 17:17:01,720 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-10-14 17:17:10,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1761648.0, ans=0.125 2023-10-14 17:17:23,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1761694.6666666667, ans=0.125 2023-10-14 17:17:35,753 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=22.5 2023-10-14 17:17:47,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1761834.6666666667, ans=0.1 2023-10-14 17:18:01,785 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-10-14 17:18:05,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1761881.3333333333, ans=0.1 2023-10-14 17:18:06,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1761881.3333333333, ans=0.02 2023-10-14 17:18:08,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.787e+02 1.927e+02 2.145e+02 3.324e+02, threshold=3.854e+02, percent-clipped=0.0 2023-10-14 17:18:16,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1761928.0, ans=0.125 2023-10-14 17:18:45,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1762068.0, ans=0.0 2023-10-14 17:19:01,669 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.75 vs. limit=5.0 2023-10-14 17:19:28,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1762208.0, ans=0.1 2023-10-14 17:19:44,944 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:19:53,234 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:19:57,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.21 vs. limit=15.0 2023-10-14 17:19:58,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1762348.0, ans=0.125 2023-10-14 17:20:02,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.823e+02 2.044e+02 2.248e+02 3.008e+02, threshold=4.088e+02, percent-clipped=0.0 2023-10-14 17:20:03,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-10-14 17:20:12,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-10-14 17:20:34,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1762488.0, ans=0.09899494936611666 2023-10-14 17:20:36,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1762488.0, ans=0.0 2023-10-14 17:20:39,427 INFO [train.py:1031] (0/4) Epoch 28, batch 9000, loss[loss=0.1864, simple_loss=0.2889, pruned_loss=0.04195, over 16869.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2776, pruned_loss=0.046, over 32499295.83 frames. ], batch size: 87, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 17:20:39,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1762534.6666666667, ans=0.1 2023-10-14 17:20:47,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1762534.6666666667, ans=0.125 2023-10-14 17:21:40,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1762768.0, ans=0.2 2023-10-14 17:21:46,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1762814.6666666667, ans=0.125 2023-10-14 17:21:48,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1762814.6666666667, ans=0.0 2023-10-14 17:21:49,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.49 vs. limit=15.0 2023-10-14 17:21:56,935 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.774e+02 1.875e+02 2.108e+02 2.976e+02, threshold=3.749e+02, percent-clipped=0.0 2023-10-14 17:22:11,054 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1762908.0, ans=0.0 2023-10-14 17:22:17,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1762908.0, ans=0.125 2023-10-14 17:22:44,347 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1763048.0, ans=0.0 2023-10-14 17:23:05,791 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.86 vs. limit=15.0 2023-10-14 17:23:17,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1763188.0, ans=0.0 2023-10-14 17:23:25,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1763188.0, ans=0.125 2023-10-14 17:23:27,209 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1763234.6666666667, ans=0.5 2023-10-14 17:23:30,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.38 vs. limit=15.0 2023-10-14 17:23:46,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1763281.3333333333, ans=0.1 2023-10-14 17:23:48,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.844e+02 1.987e+02 2.202e+02 2.868e+02, threshold=3.975e+02, percent-clipped=0.0 2023-10-14 17:23:52,258 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-10-14 17:23:55,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1763328.0, ans=0.0 2023-10-14 17:23:58,077 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1763328.0, ans=0.0 2023-10-14 17:24:55,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1763608.0, ans=0.125 2023-10-14 17:25:07,573 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.89 vs. limit=22.5 2023-10-14 17:25:35,149 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.47 vs. limit=22.5 2023-10-14 17:25:38,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.901e+02 2.045e+02 2.291e+02 2.824e+02, threshold=4.090e+02, percent-clipped=0.0 2023-10-14 17:25:39,373 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1763794.6666666667, ans=0.0 2023-10-14 17:25:45,882 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1763794.6666666667, ans=0.0 2023-10-14 17:25:48,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1763794.6666666667, ans=0.0 2023-10-14 17:25:51,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1763841.3333333333, ans=0.125 2023-10-14 17:26:14,561 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.46 vs. limit=15.0 2023-10-14 17:26:27,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1763981.3333333333, ans=0.1 2023-10-14 17:26:31,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1763981.3333333333, ans=15.0 2023-10-14 17:26:58,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1764074.6666666667, ans=0.125 2023-10-14 17:27:09,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1764121.3333333333, ans=0.125 2023-10-14 17:27:11,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1764121.3333333333, ans=0.125 2023-10-14 17:27:11,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1764168.0, ans=0.0 2023-10-14 17:27:12,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1764168.0, ans=0.1 2023-10-14 17:27:15,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1764168.0, ans=0.05 2023-10-14 17:27:18,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1764168.0, ans=0.05 2023-10-14 17:27:35,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1764214.6666666667, ans=0.125 2023-10-14 17:27:39,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 1.893e+02 2.064e+02 2.271e+02 2.785e+02, threshold=4.128e+02, percent-clipped=0.0 2023-10-14 17:27:43,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1764261.3333333333, ans=0.0 2023-10-14 17:28:00,215 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.26 vs. limit=6.0 2023-10-14 17:28:08,166 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:28:15,746 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.26 vs. limit=15.0 2023-10-14 17:28:20,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1764401.3333333333, ans=0.2 2023-10-14 17:28:30,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1764448.0, ans=0.125 2023-10-14 17:28:39,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1764448.0, ans=0.0 2023-10-14 17:28:46,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1764494.6666666667, ans=0.125 2023-10-14 17:29:02,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1764541.3333333333, ans=0.125 2023-10-14 17:29:02,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1764541.3333333333, ans=0.125 2023-10-14 17:29:02,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1764541.3333333333, ans=0.0 2023-10-14 17:29:13,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1764588.0, ans=0.125 2023-10-14 17:29:28,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1764634.6666666667, ans=0.2 2023-10-14 17:29:38,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1764681.3333333333, ans=0.025 2023-10-14 17:29:38,975 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.531e-03 2023-10-14 17:29:44,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.823e+02 1.983e+02 2.205e+02 3.512e+02, threshold=3.965e+02, percent-clipped=0.0 2023-10-14 17:30:06,833 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.66 vs. limit=15.0 2023-10-14 17:30:23,024 INFO [train.py:1031] (0/4) Epoch 28, batch 9500, loss[loss=0.1856, simple_loss=0.2783, pruned_loss=0.0464, over 16039.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2785, pruned_loss=0.04633, over 32583978.20 frames. ], batch size: 296, lr: 1.22e-03, grad_scale: 16.0 2023-10-14 17:30:35,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.38 vs. limit=15.0 2023-10-14 17:30:57,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1765008.0, ans=0.0 2023-10-14 17:31:29,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1765101.3333333333, ans=0.2 2023-10-14 17:31:38,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1765148.0, ans=0.05 2023-10-14 17:31:40,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1765148.0, ans=0.125 2023-10-14 17:31:40,874 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1765148.0, ans=0.0 2023-10-14 17:31:42,461 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1765194.6666666667, ans=0.0 2023-10-14 17:31:43,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.868e+02 2.072e+02 2.289e+02 2.956e+02, threshold=4.143e+02, percent-clipped=0.0 2023-10-14 17:31:52,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1765194.6666666667, ans=0.0 2023-10-14 17:32:00,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.38 vs. limit=15.0 2023-10-14 17:32:01,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1765241.3333333333, ans=0.125 2023-10-14 17:32:15,312 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-10-14 17:32:34,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1765381.3333333333, ans=0.125 2023-10-14 17:32:59,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1765474.6666666667, ans=0.2 2023-10-14 17:33:42,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1765614.6666666667, ans=0.5 2023-10-14 17:33:48,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.454e+02 1.827e+02 1.965e+02 2.176e+02 3.278e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-14 17:33:50,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1765661.3333333333, ans=0.0 2023-10-14 17:33:52,327 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.81 vs. limit=6.0 2023-10-14 17:33:57,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1765661.3333333333, ans=0.125 2023-10-14 17:34:03,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.97 vs. limit=22.5 2023-10-14 17:34:14,924 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-10-14 17:34:23,286 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-10-14 17:34:23,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-10-14 17:34:31,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1765801.3333333333, ans=0.125 2023-10-14 17:34:51,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1765894.6666666667, ans=0.1 2023-10-14 17:35:04,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1765941.3333333333, ans=0.1 2023-10-14 17:35:04,601 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.89 vs. limit=12.0 2023-10-14 17:35:16,386 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.82 vs. limit=10.0 2023-10-14 17:35:20,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1766034.6666666667, ans=0.0 2023-10-14 17:35:23,239 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.88 vs. limit=22.5 2023-10-14 17:35:35,763 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.52 vs. limit=22.5 2023-10-14 17:35:46,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.876e+02 2.076e+02 2.280e+02 3.008e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-14 17:35:49,930 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.22 vs. limit=15.0 2023-10-14 17:35:51,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1766128.0, ans=0.125 2023-10-14 17:36:17,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1766221.3333333333, ans=0.0 2023-10-14 17:36:35,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-10-14 17:36:41,357 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 17:36:43,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1766314.6666666667, ans=0.0 2023-10-14 17:36:46,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.71 vs. limit=22.5 2023-10-14 17:36:58,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1766361.3333333333, ans=0.05 2023-10-14 17:37:08,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1766408.0, ans=0.1 2023-10-14 17:37:09,865 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=1766408.0, ans=15.0 2023-10-14 17:37:17,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1766454.6666666667, ans=0.125 2023-10-14 17:37:21,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1766454.6666666667, ans=0.0 2023-10-14 17:37:28,577 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-14 17:37:32,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2023-10-14 17:37:55,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.814e+02 1.929e+02 2.085e+02 3.132e+02, threshold=3.857e+02, percent-clipped=0.0 2023-10-14 17:37:56,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1766594.6666666667, ans=0.125 2023-10-14 17:38:01,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-10-14 17:38:02,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1766594.6666666667, ans=0.125 2023-10-14 17:38:06,979 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1766641.3333333333, ans=0.125 2023-10-14 17:38:07,032 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1766641.3333333333, ans=0.125 2023-10-14 17:38:37,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1766734.6666666667, ans=0.125 2023-10-14 17:38:46,982 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.74 vs. limit=15.0 2023-10-14 17:38:47,047 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.15 vs. limit=22.5 2023-10-14 17:39:05,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1766874.6666666667, ans=0.0 2023-10-14 17:39:08,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.26 vs. limit=15.0 2023-10-14 17:39:19,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1766921.3333333333, ans=0.2 2023-10-14 17:39:34,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1766968.0, ans=0.125 2023-10-14 17:39:48,445 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.786e+02 1.935e+02 2.199e+02 2.905e+02, threshold=3.870e+02, percent-clipped=0.0 2023-10-14 17:39:50,316 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1767061.3333333333, ans=0.0 2023-10-14 17:39:53,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1767061.3333333333, ans=0.2 2023-10-14 17:40:21,757 INFO [train.py:1031] (0/4) Epoch 28, batch 10000, loss[loss=0.1997, simple_loss=0.2895, pruned_loss=0.05496, over 16634.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2776, pruned_loss=0.04613, over 32621179.63 frames. ], batch size: 220, lr: 1.22e-03, grad_scale: 16.0 2023-10-14 17:40:35,214 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-10-14 17:40:36,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1767248.0, ans=0.1 2023-10-14 17:41:02,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=1767341.3333333333, ans=0.05 2023-10-14 17:41:03,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1767341.3333333333, ans=0.125 2023-10-14 17:41:14,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1767388.0, ans=0.125 2023-10-14 17:41:16,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1767388.0, ans=0.125 2023-10-14 17:41:17,210 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.65 vs. limit=10.0 2023-10-14 17:41:35,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1767481.3333333333, ans=0.09899494936611666 2023-10-14 17:41:44,053 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.874e+02 2.049e+02 2.321e+02 2.850e+02, threshold=4.099e+02, percent-clipped=0.0 2023-10-14 17:41:50,535 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-10-14 17:41:52,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1767528.0, ans=0.125 2023-10-14 17:41:57,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1767574.6666666667, ans=0.125 2023-10-14 17:41:57,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1767574.6666666667, ans=0.125 2023-10-14 17:42:05,521 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=22.5 2023-10-14 17:42:09,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1767621.3333333333, ans=0.125 2023-10-14 17:43:03,489 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1767808.0, ans=0.125 2023-10-14 17:43:03,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1767808.0, ans=0.0 2023-10-14 17:43:30,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1767901.3333333333, ans=0.125 2023-10-14 17:43:37,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1767948.0, ans=0.1 2023-10-14 17:43:45,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1767948.0, ans=0.04949747468305833 2023-10-14 17:43:46,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1767994.6666666667, ans=0.0 2023-10-14 17:43:50,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.870e+02 2.097e+02 2.339e+02 3.225e+02, threshold=4.193e+02, percent-clipped=0.0 2023-10-14 17:44:25,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1768134.6666666667, ans=10.0 2023-10-14 17:44:50,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1768181.3333333333, ans=0.0 2023-10-14 17:44:59,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1768228.0, ans=0.0 2023-10-14 17:45:00,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1768228.0, ans=0.125 2023-10-14 17:45:13,390 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.04 vs. limit=15.0 2023-10-14 17:45:58,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.880e+02 2.025e+02 2.235e+02 3.706e+02, threshold=4.049e+02, percent-clipped=0.0 2023-10-14 17:46:37,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1768601.3333333333, ans=0.0 2023-10-14 17:46:39,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1768601.3333333333, ans=0.2 2023-10-14 17:46:50,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1768648.0, ans=0.5 2023-10-14 17:46:51,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1768648.0, ans=0.0 2023-10-14 17:46:51,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1768648.0, ans=0.1 2023-10-14 17:47:02,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1768694.6666666667, ans=0.0 2023-10-14 17:47:04,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1768694.6666666667, ans=0.125 2023-10-14 17:47:08,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1768741.3333333333, ans=0.0 2023-10-14 17:47:42,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1768834.6666666667, ans=0.125 2023-10-14 17:47:46,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1768834.6666666667, ans=0.0 2023-10-14 17:47:49,285 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2023-10-14 17:47:52,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1768881.3333333333, ans=0.1 2023-10-14 17:48:02,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 1.783e+02 1.969e+02 2.150e+02 2.986e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-14 17:48:37,065 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1769068.0, ans=0.125 2023-10-14 17:48:42,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1769068.0, ans=0.0 2023-10-14 17:48:51,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1769114.6666666667, ans=0.125 2023-10-14 17:48:58,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1769114.6666666667, ans=0.07 2023-10-14 17:49:42,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1769301.3333333333, ans=0.0 2023-10-14 17:49:47,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1769301.3333333333, ans=0.05 2023-10-14 17:50:02,707 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.67 vs. limit=15.0 2023-10-14 17:50:10,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.860e+02 2.015e+02 2.256e+02 3.187e+02, threshold=4.030e+02, percent-clipped=0.0 2023-10-14 17:50:17,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1769394.6666666667, ans=0.1 2023-10-14 17:50:19,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1769441.3333333333, ans=0.0 2023-10-14 17:50:43,848 INFO [train.py:1031] (0/4) Epoch 28, batch 10500, loss[loss=0.1756, simple_loss=0.2695, pruned_loss=0.04088, over 16935.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2782, pruned_loss=0.04631, over 32676542.82 frames. ], batch size: 72, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 17:50:44,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1769534.6666666667, ans=0.125 2023-10-14 17:50:58,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1769581.3333333333, ans=0.125 2023-10-14 17:50:59,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1769581.3333333333, ans=0.04949747468305833 2023-10-14 17:51:18,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1769628.0, ans=0.1 2023-10-14 17:51:20,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1769628.0, ans=0.125 2023-10-14 17:51:20,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1769628.0, ans=0.09899494936611666 2023-10-14 17:51:26,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1769674.6666666667, ans=0.125 2023-10-14 17:52:02,759 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1769814.6666666667, ans=0.5 2023-10-14 17:52:20,227 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.838e+02 1.981e+02 2.104e+02 2.965e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-14 17:52:50,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1769954.6666666667, ans=0.0 2023-10-14 17:52:51,087 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-10-14 17:52:54,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1769954.6666666667, ans=0.0 2023-10-14 17:52:55,048 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-10-14 17:53:07,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1770001.3333333333, ans=0.125 2023-10-14 17:53:34,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1770094.6666666667, ans=0.1 2023-10-14 17:54:09,878 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=15.0 2023-10-14 17:54:11,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1770234.6666666667, ans=0.0 2023-10-14 17:54:30,989 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.785e+02 1.996e+02 2.167e+02 2.714e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-14 17:54:53,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1770374.6666666667, ans=0.125 2023-10-14 17:55:02,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1770421.3333333333, ans=0.125 2023-10-14 17:55:15,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1770468.0, ans=0.125 2023-10-14 17:55:15,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.75 vs. limit=15.0 2023-10-14 17:55:16,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1770468.0, ans=0.07 2023-10-14 17:55:17,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1770514.6666666667, ans=0.125 2023-10-14 17:55:34,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1770561.3333333333, ans=0.0 2023-10-14 17:55:53,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1770608.0, ans=0.2 2023-10-14 17:56:07,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1770654.6666666667, ans=0.125 2023-10-14 17:56:18,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1770701.3333333333, ans=0.0 2023-10-14 17:56:32,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1770748.0, ans=0.125 2023-10-14 17:56:39,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=1770794.6666666667, ans=0.2 2023-10-14 17:56:39,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1770794.6666666667, ans=0.125 2023-10-14 17:56:41,329 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.860e+02 2.084e+02 2.369e+02 3.337e+02, threshold=4.167e+02, percent-clipped=0.0 2023-10-14 17:57:05,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1770888.0, ans=0.2 2023-10-14 17:57:21,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.07 vs. limit=15.0 2023-10-14 17:57:28,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1770934.6666666667, ans=0.0 2023-10-14 17:57:45,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1771028.0, ans=0.125 2023-10-14 17:57:58,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1771074.6666666667, ans=0.015 2023-10-14 17:58:05,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1771121.3333333333, ans=0.1 2023-10-14 17:58:05,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1771121.3333333333, ans=0.09899494936611666 2023-10-14 17:58:22,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1771168.0, ans=0.2 2023-10-14 17:58:47,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1771261.3333333333, ans=0.0 2023-10-14 17:58:50,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.828e+02 2.003e+02 2.303e+02 3.052e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-14 17:58:55,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1771261.3333333333, ans=0.125 2023-10-14 17:59:10,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1771354.6666666667, ans=0.125 2023-10-14 17:59:38,198 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1771448.0, ans=0.125 2023-10-14 17:59:58,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1771541.3333333333, ans=0.125 2023-10-14 18:00:11,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1771588.0, ans=0.2 2023-10-14 18:00:18,411 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1771588.0, ans=0.125 2023-10-14 18:00:19,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1771634.6666666667, ans=0.125 2023-10-14 18:00:50,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.859e+02 2.052e+02 2.288e+02 2.934e+02, threshold=4.103e+02, percent-clipped=0.0 2023-10-14 18:00:52,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1771728.0, ans=0.125 2023-10-14 18:01:09,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1771821.3333333333, ans=0.125 2023-10-14 18:01:18,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1771821.3333333333, ans=0.0 2023-10-14 18:01:22,079 INFO [train.py:1031] (0/4) Epoch 28, batch 11000, loss[loss=0.1744, simple_loss=0.2687, pruned_loss=0.04011, over 16837.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2782, pruned_loss=0.04642, over 32695317.03 frames. ], batch size: 72, lr: 1.22e-03, grad_scale: 16.0 2023-10-14 18:01:23,364 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1771868.0, ans=0.125 2023-10-14 18:01:30,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1771868.0, ans=0.0 2023-10-14 18:01:32,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1771914.6666666667, ans=0.125 2023-10-14 18:01:41,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1771914.6666666667, ans=0.0 2023-10-14 18:02:36,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=1772148.0, ans=22.5 2023-10-14 18:02:41,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1772148.0, ans=0.09899494936611666 2023-10-14 18:02:53,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1772194.6666666667, ans=0.0 2023-10-14 18:02:56,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.846e+02 2.026e+02 2.213e+02 2.856e+02, threshold=4.052e+02, percent-clipped=0.0 2023-10-14 18:03:10,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1772241.3333333333, ans=0.125 2023-10-14 18:03:16,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1772288.0, ans=0.2 2023-10-14 18:03:29,621 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1772334.6666666667, ans=0.1 2023-10-14 18:03:44,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1772381.3333333333, ans=0.0 2023-10-14 18:04:26,083 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.90 vs. limit=15.0 2023-10-14 18:04:52,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1772568.0, ans=0.125 2023-10-14 18:05:14,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.767e+02 1.964e+02 2.270e+02 3.947e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-14 18:05:14,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1772661.3333333333, ans=0.1 2023-10-14 18:05:14,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1772661.3333333333, ans=0.2 2023-10-14 18:05:36,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1772754.6666666667, ans=0.0 2023-10-14 18:06:13,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1772848.0, ans=0.02 2023-10-14 18:06:24,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1772894.6666666667, ans=0.0 2023-10-14 18:06:31,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1772941.3333333333, ans=0.125 2023-10-14 18:06:48,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1772988.0, ans=0.125 2023-10-14 18:07:28,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.830e+02 2.022e+02 2.274e+02 3.400e+02, threshold=4.044e+02, percent-clipped=0.0 2023-10-14 18:07:47,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1773221.3333333333, ans=0.0 2023-10-14 18:08:21,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1773314.6666666667, ans=0.0 2023-10-14 18:08:55,871 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1773408.0, ans=0.2 2023-10-14 18:08:56,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1773454.6666666667, ans=0.0 2023-10-14 18:09:13,536 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-10-14 18:09:39,398 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:09:42,927 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.19 vs. limit=15.0 2023-10-14 18:09:48,490 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.861e+02 2.016e+02 2.275e+02 3.565e+02, threshold=4.032e+02, percent-clipped=0.0 2023-10-14 18:09:51,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1773594.6666666667, ans=0.2 2023-10-14 18:10:01,002 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1773641.3333333333, ans=0.125 2023-10-14 18:10:19,400 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1773688.0, ans=0.125 2023-10-14 18:10:19,570 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-10-14 18:10:30,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.84 vs. limit=22.5 2023-10-14 18:10:34,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1773734.6666666667, ans=0.125 2023-10-14 18:10:38,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1773781.3333333333, ans=0.0 2023-10-14 18:10:42,947 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.47 vs. limit=8.0 2023-10-14 18:11:34,531 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:12:04,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1774014.6666666667, ans=0.1 2023-10-14 18:12:25,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.999e+02 2.173e+02 2.454e+02 4.125e+02, threshold=4.347e+02, percent-clipped=1.0 2023-10-14 18:12:33,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.60 vs. limit=22.5 2023-10-14 18:12:59,583 INFO [train.py:1031] (0/4) Epoch 28, batch 11500, loss[loss=0.1881, simple_loss=0.2824, pruned_loss=0.04689, over 16645.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2778, pruned_loss=0.04618, over 32715627.66 frames. ], batch size: 66, lr: 1.22e-03, grad_scale: 32.0 2023-10-14 18:13:26,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1774248.0, ans=0.05 2023-10-14 18:13:41,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1774294.6666666667, ans=0.125 2023-10-14 18:13:47,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1774341.3333333333, ans=0.1 2023-10-14 18:14:05,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1774388.0, ans=0.1 2023-10-14 18:14:21,500 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:14:38,021 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1774481.3333333333, ans=0.125 2023-10-14 18:14:47,623 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.807e+02 2.027e+02 2.215e+02 3.004e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-14 18:15:04,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1774574.6666666667, ans=0.125 2023-10-14 18:15:36,618 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:15:45,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1774714.6666666667, ans=0.0 2023-10-14 18:16:00,001 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1774761.3333333333, ans=0.2 2023-10-14 18:16:00,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1774761.3333333333, ans=0.04949747468305833 2023-10-14 18:16:00,100 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:16:10,979 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2023-10-14 18:16:21,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1774808.0, ans=0.0 2023-10-14 18:16:39,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1774854.6666666667, ans=0.07 2023-10-14 18:17:17,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1774948.0, ans=0.125 2023-10-14 18:17:24,225 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1774994.6666666667, ans=0.2 2023-10-14 18:17:33,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.760e+02 1.954e+02 2.146e+02 2.843e+02, threshold=3.907e+02, percent-clipped=0.0 2023-10-14 18:17:46,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1775041.3333333333, ans=0.0 2023-10-14 18:18:00,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1775088.0, ans=0.0 2023-10-14 18:18:13,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1775134.6666666667, ans=0.125 2023-10-14 18:18:18,997 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-10-14 18:18:25,663 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1775181.3333333333, ans=0.0 2023-10-14 18:18:32,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1775181.3333333333, ans=0.125 2023-10-14 18:18:36,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1775228.0, ans=0.125 2023-10-14 18:18:50,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=22.5 2023-10-14 18:18:53,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1775274.6666666667, ans=0.0 2023-10-14 18:19:44,565 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:20:00,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1775461.3333333333, ans=0.0 2023-10-14 18:20:05,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.845e+02 1.997e+02 2.188e+02 2.742e+02, threshold=3.995e+02, percent-clipped=0.0 2023-10-14 18:20:28,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1775508.0, ans=0.125 2023-10-14 18:20:36,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.90 vs. limit=15.0 2023-10-14 18:20:48,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1775554.6666666667, ans=0.125 2023-10-14 18:20:58,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1775601.3333333333, ans=0.125 2023-10-14 18:21:22,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1775648.0, ans=0.125 2023-10-14 18:21:34,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1775694.6666666667, ans=22.5 2023-10-14 18:22:00,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1775788.0, ans=0.1 2023-10-14 18:22:13,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1775834.6666666667, ans=0.0 2023-10-14 18:22:25,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1775881.3333333333, ans=0.125 2023-10-14 18:22:29,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.63 vs. limit=10.0 2023-10-14 18:22:30,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1775881.3333333333, ans=0.125 2023-10-14 18:22:49,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.849e+02 2.030e+02 2.259e+02 3.116e+02, threshold=4.059e+02, percent-clipped=0.0 2023-10-14 18:22:50,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1775928.0, ans=0.0 2023-10-14 18:23:11,340 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.28 vs. limit=22.5 2023-10-14 18:23:51,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=1776114.6666666667, ans=0.2 2023-10-14 18:23:52,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1776114.6666666667, ans=0.0 2023-10-14 18:23:58,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1776161.3333333333, ans=0.0 2023-10-14 18:23:58,971 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.73 vs. limit=10.0 2023-10-14 18:23:59,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1776161.3333333333, ans=0.2 2023-10-14 18:24:44,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1776301.3333333333, ans=0.125 2023-10-14 18:24:45,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1776301.3333333333, ans=0.125 2023-10-14 18:25:16,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1776394.6666666667, ans=0.2 2023-10-14 18:25:17,282 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1776394.6666666667, ans=0.0 2023-10-14 18:25:24,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.844e+02 2.117e+02 2.402e+02 3.426e+02, threshold=4.235e+02, percent-clipped=0.0 2023-10-14 18:25:28,046 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.73 vs. limit=10.0 2023-10-14 18:25:33,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1776441.3333333333, ans=0.0 2023-10-14 18:25:42,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1776441.3333333333, ans=0.2 2023-10-14 18:25:59,529 INFO [train.py:1031] (0/4) Epoch 28, batch 12000, loss[loss=0.1927, simple_loss=0.2854, pruned_loss=0.04994, over 16860.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.278, pruned_loss=0.04607, over 32741524.00 frames. ], batch size: 130, lr: 1.22e-03, grad_scale: 16.0 2023-10-14 18:26:00,453 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-10-14 18:26:26,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-10-14 18:26:39,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1776628.0, ans=0.125 2023-10-14 18:27:03,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1776721.3333333333, ans=0.0 2023-10-14 18:27:09,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1776721.3333333333, ans=0.2 2023-10-14 18:27:16,223 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-10-14 18:27:51,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1776861.3333333333, ans=0.125 2023-10-14 18:27:52,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.796e+02 1.974e+02 2.184e+02 3.243e+02, threshold=3.948e+02, percent-clipped=0.0 2023-10-14 18:27:56,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1776908.0, ans=0.1 2023-10-14 18:28:09,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1776908.0, ans=0.125 2023-10-14 18:28:12,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1776908.0, ans=0.0 2023-10-14 18:28:16,529 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1776954.6666666667, ans=0.125 2023-10-14 18:28:40,495 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-14 18:29:17,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1777141.3333333333, ans=0.0 2023-10-14 18:29:26,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1777141.3333333333, ans=0.025 2023-10-14 18:29:51,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1777234.6666666667, ans=0.125 2023-10-14 18:29:51,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1777234.6666666667, ans=0.07 2023-10-14 18:30:21,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.868e+02 2.112e+02 2.473e+02 3.637e+02, threshold=4.224e+02, percent-clipped=0.0 2023-10-14 18:30:25,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777374.6666666667, ans=0.1 2023-10-14 18:30:32,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.80 vs. limit=22.5 2023-10-14 18:30:33,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1777374.6666666667, ans=0.0 2023-10-14 18:31:08,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1777468.0, ans=0.1 2023-10-14 18:32:01,615 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1777654.6666666667, ans=0.1 2023-10-14 18:32:08,756 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.16 vs. limit=10.0 2023-10-14 18:32:13,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1777701.3333333333, ans=0.0 2023-10-14 18:32:29,150 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:32:54,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1777794.6666666667, ans=0.0 2023-10-14 18:32:56,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 1.824e+02 1.979e+02 2.184e+02 2.986e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-14 18:33:14,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=1777841.3333333333, ans=6.0 2023-10-14 18:33:24,986 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.02 vs. limit=15.0 2023-10-14 18:33:54,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1777981.3333333333, ans=0.125 2023-10-14 18:34:01,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1777981.3333333333, ans=0.125 2023-10-14 18:34:08,774 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:34:53,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1778121.3333333333, ans=0.125 2023-10-14 18:35:00,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1778168.0, ans=0.0 2023-10-14 18:35:02,582 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.87 vs. limit=15.0 2023-10-14 18:35:03,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1778168.0, ans=0.125 2023-10-14 18:35:29,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1778261.3333333333, ans=0.125 2023-10-14 18:35:32,652 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.864e+02 2.012e+02 2.236e+02 3.219e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-14 18:35:43,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1778308.0, ans=0.1 2023-10-14 18:35:43,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1778308.0, ans=0.2 2023-10-14 18:35:57,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1778354.6666666667, ans=0.0 2023-10-14 18:35:58,005 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-10-14 18:36:31,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=1778494.6666666667, ans=10.0 2023-10-14 18:36:57,075 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-10-14 18:36:57,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1778588.0, ans=0.125 2023-10-14 18:37:05,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1778634.6666666667, ans=0.125 2023-10-14 18:37:28,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1778728.0, ans=0.035 2023-10-14 18:37:32,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 1.919e+02 2.160e+02 2.359e+02 3.188e+02, threshold=4.320e+02, percent-clipped=0.0 2023-10-14 18:37:59,668 INFO [train.py:1031] (0/4) Epoch 28, batch 12500, loss[loss=0.1925, simple_loss=0.2628, pruned_loss=0.06111, over 12726.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2776, pruned_loss=0.04601, over 32764702.62 frames. ], batch size: 440, lr: 1.21e-03, grad_scale: 16.0 2023-10-14 18:38:05,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1778868.0, ans=0.125 2023-10-14 18:38:06,333 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1778868.0, ans=0.1 2023-10-14 18:38:10,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1778914.6666666667, ans=0.125 2023-10-14 18:38:14,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1778914.6666666667, ans=0.125 2023-10-14 18:38:19,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1778914.6666666667, ans=0.0 2023-10-14 18:38:21,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1778961.3333333333, ans=0.0 2023-10-14 18:38:42,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1779054.6666666667, ans=0.125 2023-10-14 18:38:45,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1779054.6666666667, ans=0.0 2023-10-14 18:39:07,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=1779148.0, ans=0.0 2023-10-14 18:39:11,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1779148.0, ans=0.125 2023-10-14 18:39:13,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1779148.0, ans=0.1 2023-10-14 18:39:17,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1779194.6666666667, ans=0.0 2023-10-14 18:39:23,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 1.850e+02 2.015e+02 2.266e+02 3.062e+02, threshold=4.030e+02, percent-clipped=0.0 2023-10-14 18:39:24,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.82 vs. limit=22.5 2023-10-14 18:39:29,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1779241.3333333333, ans=0.0 2023-10-14 18:39:31,980 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-10-14 18:39:39,800 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1779288.0, ans=0.1 2023-10-14 18:39:53,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1779334.6666666667, ans=0.5 2023-10-14 18:39:56,739 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1779334.6666666667, ans=0.2 2023-10-14 18:40:05,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1779381.3333333333, ans=0.1 2023-10-14 18:40:22,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1779428.0, ans=0.0 2023-10-14 18:40:35,497 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1779521.3333333333, ans=0.125 2023-10-14 18:40:38,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1779521.3333333333, ans=0.125 2023-10-14 18:41:18,016 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-10-14 18:41:20,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.844e+02 1.985e+02 2.171e+02 2.842e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-14 18:41:20,810 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=12.0 2023-10-14 18:41:28,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1779708.0, ans=0.0 2023-10-14 18:41:30,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1779708.0, ans=0.125 2023-10-14 18:41:41,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1779754.6666666667, ans=0.0 2023-10-14 18:41:50,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1779801.3333333333, ans=0.125 2023-10-14 18:42:10,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2023-10-14 18:42:15,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1779894.6666666667, ans=0.035 2023-10-14 18:42:41,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1779988.0, ans=0.125 2023-10-14 18:42:52,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1780034.6666666667, ans=0.125 2023-10-14 18:43:16,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.819e+02 1.998e+02 2.182e+02 3.079e+02, threshold=3.996e+02, percent-clipped=0.0 2023-10-14 18:43:25,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1780174.6666666667, ans=0.0 2023-10-14 18:43:30,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1780221.3333333333, ans=0.2 2023-10-14 18:43:52,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-10-14 18:43:55,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1780314.6666666667, ans=0.0 2023-10-14 18:44:23,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1780408.0, ans=0.125 2023-10-14 18:44:27,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1780408.0, ans=0.1 2023-10-14 18:44:39,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1780454.6666666667, ans=0.125 2023-10-14 18:45:05,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1780548.0, ans=0.07 2023-10-14 18:45:15,414 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.826e+02 1.980e+02 2.187e+02 3.955e+02, threshold=3.960e+02, percent-clipped=0.0 2023-10-14 18:45:20,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1780641.3333333333, ans=0.125 2023-10-14 18:46:40,041 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.51 vs. limit=12.0 2023-10-14 18:46:57,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1781014.6666666667, ans=0.0 2023-10-14 18:47:03,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1781014.6666666667, ans=0.125 2023-10-14 18:47:11,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.773e+02 1.943e+02 2.150e+02 2.610e+02, threshold=3.886e+02, percent-clipped=0.0 2023-10-14 18:47:27,275 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1781154.6666666667, ans=0.125 2023-10-14 18:47:30,754 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:47:39,956 INFO [train.py:1031] (0/4) Epoch 28, batch 13000, loss[loss=0.1914, simple_loss=0.2832, pruned_loss=0.04983, over 16585.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2783, pruned_loss=0.04608, over 32781197.74 frames. ], batch size: 241, lr: 1.21e-03, grad_scale: 32.0 2023-10-14 18:47:54,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1781248.0, ans=0.2 2023-10-14 18:48:24,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1781341.3333333333, ans=0.0 2023-10-14 18:48:27,025 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1781341.3333333333, ans=0.1 2023-10-14 18:48:38,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1781388.0, ans=0.125 2023-10-14 18:48:54,363 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.51 vs. limit=15.0 2023-10-14 18:49:05,434 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.72 vs. limit=15.0 2023-10-14 18:49:16,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1781481.3333333333, ans=0.2 2023-10-14 18:49:19,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.12 vs. limit=15.0 2023-10-14 18:49:19,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.83 vs. limit=15.0 2023-10-14 18:49:20,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1781528.0, ans=0.0 2023-10-14 18:49:31,541 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.855e+02 2.068e+02 2.346e+02 3.226e+02, threshold=4.136e+02, percent-clipped=0.0 2023-10-14 18:49:34,144 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=12.0 2023-10-14 18:49:52,510 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.15 vs. limit=22.5 2023-10-14 18:50:02,806 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1781668.0, ans=0.125 2023-10-14 18:50:23,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1781761.3333333333, ans=0.2 2023-10-14 18:50:41,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1781808.0, ans=0.125 2023-10-14 18:51:19,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1781948.0, ans=0.025 2023-10-14 18:51:37,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.859e+02 2.059e+02 2.320e+02 3.388e+02, threshold=4.118e+02, percent-clipped=0.0 2023-10-14 18:51:42,416 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1782041.3333333333, ans=0.2 2023-10-14 18:51:46,768 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-10-14 18:52:05,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1782088.0, ans=0.5 2023-10-14 18:52:15,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1782134.6666666667, ans=0.125 2023-10-14 18:52:25,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1782181.3333333333, ans=0.0 2023-10-14 18:53:08,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1782321.3333333333, ans=0.1 2023-10-14 18:53:08,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1782321.3333333333, ans=0.0 2023-10-14 18:53:10,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1782368.0, ans=0.125 2023-10-14 18:53:13,558 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-10-14 18:53:23,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1782414.6666666667, ans=0.125 2023-10-14 18:53:39,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1782461.3333333333, ans=0.125 2023-10-14 18:53:42,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1782461.3333333333, ans=0.1 2023-10-14 18:53:42,996 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.93 vs. limit=15.0 2023-10-14 18:53:47,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.790e+02 1.901e+02 2.120e+02 3.380e+02, threshold=3.803e+02, percent-clipped=0.0 2023-10-14 18:53:53,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1782508.0, ans=0.125 2023-10-14 18:54:12,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1782554.6666666667, ans=0.125 2023-10-14 18:54:30,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1782648.0, ans=0.125 2023-10-14 18:54:36,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1782648.0, ans=0.125 2023-10-14 18:54:42,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1782694.6666666667, ans=0.2 2023-10-14 18:54:44,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1782694.6666666667, ans=0.1 2023-10-14 18:55:16,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-10-14 18:55:22,912 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.42 vs. limit=6.0 2023-10-14 18:55:30,804 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-10-14 18:55:49,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1782928.0, ans=0.1 2023-10-14 18:55:50,332 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.833e+02 2.038e+02 2.248e+02 3.459e+02, threshold=4.075e+02, percent-clipped=0.0 2023-10-14 18:56:25,394 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-10-14 18:56:43,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1783114.6666666667, ans=0.2 2023-10-14 18:56:52,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1783161.3333333333, ans=0.0 2023-10-14 18:57:21,142 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:57:26,433 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1783301.3333333333, ans=0.125 2023-10-14 18:57:35,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1783348.0, ans=0.1 2023-10-14 18:57:57,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.838e+02 2.011e+02 2.192e+02 2.929e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-14 18:58:23,896 INFO [train.py:1031] (0/4) Epoch 28, batch 13500, loss[loss=0.1725, simple_loss=0.2685, pruned_loss=0.03824, over 16785.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2776, pruned_loss=0.04597, over 32779838.21 frames. ], batch size: 81, lr: 1.21e-03, grad_scale: 16.0 2023-10-14 18:58:27,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1783534.6666666667, ans=0.1 2023-10-14 18:58:27,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1783534.6666666667, ans=0.125 2023-10-14 18:59:07,135 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 18:59:16,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.63 vs. limit=22.5 2023-10-14 18:59:16,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.64 vs. limit=8.0 2023-10-14 18:59:31,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1783768.0, ans=0.04949747468305833 2023-10-14 18:59:49,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1783861.3333333333, ans=0.125 2023-10-14 19:00:03,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.891e+02 2.072e+02 2.277e+02 3.149e+02, threshold=4.144e+02, percent-clipped=0.0 2023-10-14 19:00:04,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1783908.0, ans=0.125 2023-10-14 19:00:07,655 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.66 vs. limit=15.0 2023-10-14 19:00:17,866 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1783954.6666666667, ans=0.2 2023-10-14 19:00:19,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1783954.6666666667, ans=0.5 2023-10-14 19:00:31,690 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1784001.3333333333, ans=0.125 2023-10-14 19:00:32,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1784001.3333333333, ans=0.07 2023-10-14 19:00:42,593 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.72 vs. limit=15.0 2023-10-14 19:00:48,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1784048.0, ans=0.2 2023-10-14 19:01:00,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.47 vs. limit=6.0 2023-10-14 19:01:30,793 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-28.pt 2023-10-14 19:02:05,199 INFO [train.py:1031] (0/4) Epoch 29, batch 0, loss[loss=0.1769, simple_loss=0.2658, pruned_loss=0.04398, over 16650.00 frames. ], tot_loss[loss=0.1769, simple_loss=0.2658, pruned_loss=0.04398, over 16650.00 frames. ], batch size: 268, lr: 1.19e-03, grad_scale: 32.0 2023-10-14 19:02:05,201 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-14 19:02:11,209 INFO [zipformer.py:1853] (0/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.6428, 2.8650, 1.9682, 4.4763], device='cuda:0') 2023-10-14 19:02:11,945 INFO [zipformer.py:1853] (0/4) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([2.3186, 3.7828, 3.2897, 3.5218, 2.9891, 2.7668, 3.8460, 3.0776], device='cuda:0') 2023-10-14 19:02:13,483 INFO [train.py:1063] (0/4) Epoch 29, validation: loss=0.2131, simple_loss=0.2995, pruned_loss=0.06338, over 1020973.00 frames. 2023-10-14 19:02:13,483 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-14 19:02:45,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.901e+02 2.123e+02 2.345e+02 3.738e+02, threshold=4.247e+02, percent-clipped=0.0 2023-10-14 19:02:51,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1784351.3333333333, ans=0.0 2023-10-14 19:03:37,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1784538.0, ans=0.025 2023-10-14 19:03:47,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1784538.0, ans=0.125 2023-10-14 19:04:58,072 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.817e+02 1.909e+02 2.080e+02 2.817e+02, threshold=3.818e+02, percent-clipped=0.0 2023-10-14 19:05:19,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1784911.3333333333, ans=0.125 2023-10-14 19:05:19,812 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.84 vs. limit=12.0 2023-10-14 19:05:34,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1784958.0, ans=0.0 2023-10-14 19:05:53,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1785004.6666666667, ans=0.1 2023-10-14 19:06:40,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1785191.3333333333, ans=0.2 2023-10-14 19:06:40,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1785191.3333333333, ans=0.2 2023-10-14 19:06:56,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1785284.6666666667, ans=0.015 2023-10-14 19:07:02,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1785284.6666666667, ans=0.125 2023-10-14 19:07:03,236 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.822e+02 2.063e+02 2.222e+02 3.116e+02, threshold=4.126e+02, percent-clipped=0.0 2023-10-14 19:07:17,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1785331.3333333333, ans=0.0 2023-10-14 19:07:17,608 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=15.0 2023-10-14 19:07:21,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1785378.0, ans=0.125 2023-10-14 19:07:38,539 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1785424.6666666667, ans=0.125 2023-10-14 19:07:50,054 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=22.5 2023-10-14 19:07:53,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1785471.3333333333, ans=0.0 2023-10-14 19:07:54,795 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1785471.3333333333, ans=0.125 2023-10-14 19:08:43,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1785658.0, ans=0.125 2023-10-14 19:09:01,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1785704.6666666667, ans=0.1 2023-10-14 19:09:02,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1785704.6666666667, ans=0.95 2023-10-14 19:09:02,553 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-10-14 19:09:18,915 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.795e+02 1.962e+02 2.085e+02 4.412e+02, threshold=3.924e+02, percent-clipped=1.0 2023-10-14 19:09:52,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1785891.3333333333, ans=0.0 2023-10-14 19:09:52,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=12.0 2023-10-14 19:10:11,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1785938.0, ans=0.125 2023-10-14 19:10:15,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1785984.6666666667, ans=0.0 2023-10-14 19:10:39,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1786078.0, ans=0.0 2023-10-14 19:10:40,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1786078.0, ans=0.125 2023-10-14 19:10:45,729 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-10-14 19:10:52,924 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-10-14 19:11:03,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1786171.3333333333, ans=22.5 2023-10-14 19:11:07,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1786171.3333333333, ans=0.1 2023-10-14 19:11:15,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1786171.3333333333, ans=0.1 2023-10-14 19:11:25,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 1.935e+02 2.088e+02 2.377e+02 3.287e+02, threshold=4.176e+02, percent-clipped=0.0 2023-10-14 19:11:39,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1786264.6666666667, ans=0.125 2023-10-14 19:12:23,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1786404.6666666667, ans=0.1 2023-10-14 19:12:55,586 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1786544.6666666667, ans=0.125 2023-10-14 19:13:08,909 INFO [train.py:1031] (0/4) Epoch 29, batch 500, loss[loss=0.1794, simple_loss=0.2664, pruned_loss=0.04617, over 16020.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.277, pruned_loss=0.04607, over 7271926.63 frames. ], batch size: 296, lr: 1.19e-03, grad_scale: 32.0 2023-10-14 19:13:18,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1786591.3333333333, ans=0.2 2023-10-14 19:13:18,519 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1786591.3333333333, ans=0.09899494936611666 2023-10-14 19:13:22,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1786638.0, ans=0.0 2023-10-14 19:13:41,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.826e+02 2.008e+02 2.215e+02 3.537e+02, threshold=4.016e+02, percent-clipped=0.0 2023-10-14 19:14:02,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1786778.0, ans=0.05 2023-10-14 19:14:08,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=12.0 2023-10-14 19:14:10,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-10-14 19:14:12,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.39 vs. limit=15.0 2023-10-14 19:14:19,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1786824.6666666667, ans=0.125 2023-10-14 19:14:28,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1786871.3333333333, ans=0.0 2023-10-14 19:14:53,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1786964.6666666667, ans=0.0 2023-10-14 19:15:01,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1786964.6666666667, ans=0.125 2023-10-14 19:15:43,987 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1787151.3333333333, ans=0.2 2023-10-14 19:15:48,074 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.877e+02 2.016e+02 2.278e+02 3.167e+02, threshold=4.031e+02, percent-clipped=0.0 2023-10-14 19:15:55,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1787198.0, ans=0.125 2023-10-14 19:16:25,430 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-10-14 19:16:37,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1787338.0, ans=0.0 2023-10-14 19:16:47,496 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:16:47,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1787384.6666666667, ans=0.125 2023-10-14 19:16:58,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1787431.3333333333, ans=0.125 2023-10-14 19:17:20,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1787478.0, ans=0.1 2023-10-14 19:17:21,867 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1787524.6666666667, ans=0.0 2023-10-14 19:17:22,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1787524.6666666667, ans=0.125 2023-10-14 19:17:49,600 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.074e-02 2023-10-14 19:17:52,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1787618.0, ans=0.1 2023-10-14 19:17:53,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.911e+02 2.014e+02 2.243e+02 3.292e+02, threshold=4.028e+02, percent-clipped=0.0 2023-10-14 19:19:22,417 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-10-14 19:19:33,422 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-10-14 19:19:48,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.62 vs. limit=22.5 2023-10-14 19:20:20,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.945e+02 2.097e+02 2.352e+02 3.649e+02, threshold=4.194e+02, percent-clipped=0.0 2023-10-14 19:20:35,962 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-10-14 19:20:52,037 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2023-10-14 19:21:40,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788364.6666666667, ans=0.1 2023-10-14 19:22:00,524 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=12.0 2023-10-14 19:22:23,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1788504.6666666667, ans=0.125 2023-10-14 19:22:36,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=15.0 2023-10-14 19:22:37,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.825e+02 2.024e+02 2.199e+02 2.831e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-14 19:22:38,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1788551.3333333333, ans=0.125 2023-10-14 19:22:48,531 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.82 vs. limit=15.0 2023-10-14 19:22:50,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1788598.0, ans=0.0 2023-10-14 19:23:06,791 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2023-10-14 19:23:31,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1788738.0, ans=0.125 2023-10-14 19:23:34,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1788738.0, ans=0.0 2023-10-14 19:23:40,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1788784.6666666667, ans=0.125 2023-10-14 19:23:40,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1788784.6666666667, ans=0.125 2023-10-14 19:24:04,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1788878.0, ans=0.0 2023-10-14 19:24:19,153 INFO [train.py:1031] (0/4) Epoch 29, batch 1000, loss[loss=0.1724, simple_loss=0.27, pruned_loss=0.03736, over 16403.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2781, pruned_loss=0.04659, over 12913391.47 frames. ], batch size: 50, lr: 1.19e-03, grad_scale: 32.0 2023-10-14 19:24:19,447 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1788924.6666666667, ans=0.125 2023-10-14 19:24:32,624 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.92 vs. limit=10.0 2023-10-14 19:24:49,279 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.04 vs. limit=22.5 2023-10-14 19:24:49,615 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.765e+02 1.924e+02 2.151e+02 2.640e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-14 19:24:59,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1789064.6666666667, ans=0.0 2023-10-14 19:25:20,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1789158.0, ans=0.125 2023-10-14 19:25:21,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1789158.0, ans=10.0 2023-10-14 19:25:24,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1789158.0, ans=0.125 2023-10-14 19:25:25,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1789158.0, ans=0.2 2023-10-14 19:25:32,995 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1789204.6666666667, ans=0.125 2023-10-14 19:25:38,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1789204.6666666667, ans=0.125 2023-10-14 19:25:51,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1789251.3333333333, ans=0.1 2023-10-14 19:26:03,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1789298.0, ans=0.125 2023-10-14 19:26:07,219 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1789298.0, ans=0.125 2023-10-14 19:26:08,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1789298.0, ans=0.2 2023-10-14 19:26:09,217 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1789344.6666666667, ans=0.125 2023-10-14 19:26:13,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-10-14 19:26:27,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1789391.3333333333, ans=0.0 2023-10-14 19:26:34,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1789391.3333333333, ans=0.1 2023-10-14 19:26:39,581 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1789438.0, ans=0.035 2023-10-14 19:26:51,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789484.6666666667, ans=0.1 2023-10-14 19:26:58,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1789484.6666666667, ans=0.2 2023-10-14 19:26:59,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1789484.6666666667, ans=0.125 2023-10-14 19:26:59,841 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.851e+02 2.013e+02 2.294e+02 3.227e+02, threshold=4.026e+02, percent-clipped=0.0 2023-10-14 19:27:29,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789578.0, ans=0.1 2023-10-14 19:27:31,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789624.6666666667, ans=0.1 2023-10-14 19:28:07,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1789718.0, ans=0.125 2023-10-14 19:28:28,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1789764.6666666667, ans=0.125 2023-10-14 19:28:44,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1789811.3333333333, ans=0.125 2023-10-14 19:29:04,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1789904.6666666667, ans=0.0 2023-10-14 19:29:25,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.745e+02 1.905e+02 2.132e+02 2.871e+02, threshold=3.810e+02, percent-clipped=0.0 2023-10-14 19:30:25,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1790184.6666666667, ans=0.0 2023-10-14 19:31:16,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1790371.3333333333, ans=0.0 2023-10-14 19:31:30,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1790418.0, ans=0.0 2023-10-14 19:31:33,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1790418.0, ans=0.0 2023-10-14 19:31:35,188 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.97 vs. limit=22.5 2023-10-14 19:31:36,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.766e+02 1.912e+02 2.084e+02 3.000e+02, threshold=3.824e+02, percent-clipped=0.0 2023-10-14 19:31:52,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1790511.3333333333, ans=0.125 2023-10-14 19:32:07,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=1790558.0, ans=0.02 2023-10-14 19:32:13,205 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1790558.0, ans=0.125 2023-10-14 19:32:18,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1790604.6666666667, ans=0.05 2023-10-14 19:32:30,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1790651.3333333333, ans=0.2 2023-10-14 19:32:38,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1790651.3333333333, ans=0.0 2023-10-14 19:32:40,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1790698.0, ans=0.2 2023-10-14 19:32:46,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1790698.0, ans=0.125 2023-10-14 19:32:58,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1790744.6666666667, ans=0.125 2023-10-14 19:33:30,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1790838.0, ans=0.2 2023-10-14 19:33:40,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.395e+02 1.783e+02 1.989e+02 2.163e+02 2.964e+02, threshold=3.978e+02, percent-clipped=0.0 2023-10-14 19:33:49,361 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:33:57,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1790931.3333333333, ans=0.0 2023-10-14 19:35:00,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1791164.6666666667, ans=0.0 2023-10-14 19:35:27,050 INFO [train.py:1031] (0/4) Epoch 29, batch 1500, loss[loss=0.2036, simple_loss=0.2918, pruned_loss=0.05775, over 16883.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2768, pruned_loss=0.04593, over 17304777.02 frames. ], batch size: 116, lr: 1.19e-03, grad_scale: 16.0 2023-10-14 19:35:42,521 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:35:50,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1791304.6666666667, ans=10.0 2023-10-14 19:36:14,331 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.832e+02 2.009e+02 2.326e+02 3.322e+02, threshold=4.019e+02, percent-clipped=0.0 2023-10-14 19:36:29,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1791398.0, ans=0.125 2023-10-14 19:36:33,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-10-14 19:36:44,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1791444.6666666667, ans=0.125 2023-10-14 19:37:02,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1791538.0, ans=0.125 2023-10-14 19:37:21,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1791584.6666666667, ans=0.125 2023-10-14 19:37:22,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1791584.6666666667, ans=0.1 2023-10-14 19:37:25,990 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1791584.6666666667, ans=0.05 2023-10-14 19:37:41,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1791631.3333333333, ans=0.0 2023-10-14 19:37:42,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1791631.3333333333, ans=0.2 2023-10-14 19:37:51,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1791678.0, ans=0.125 2023-10-14 19:38:15,557 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:38:19,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1791724.6666666667, ans=0.0 2023-10-14 19:38:50,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.62 vs. limit=22.5 2023-10-14 19:38:50,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.903e+02 2.166e+02 2.383e+02 3.382e+02, threshold=4.333e+02, percent-clipped=0.0 2023-10-14 19:39:02,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1791864.6666666667, ans=0.0 2023-10-14 19:39:05,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1791864.6666666667, ans=0.1 2023-10-14 19:39:48,349 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-384000.pt 2023-10-14 19:40:05,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1792051.3333333333, ans=0.125 2023-10-14 19:40:15,088 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1792051.3333333333, ans=0.125 2023-10-14 19:40:51,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1792144.6666666667, ans=0.125 2023-10-14 19:40:57,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1792191.3333333333, ans=0.1 2023-10-14 19:41:14,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1792238.0, ans=0.125 2023-10-14 19:41:37,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1792284.6666666667, ans=0.125 2023-10-14 19:41:37,860 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.75 vs. limit=10.0 2023-10-14 19:41:38,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.817e+02 1.967e+02 2.158e+02 2.658e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-14 19:41:46,329 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1792331.3333333333, ans=0.125 2023-10-14 19:41:51,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792331.3333333333, ans=0.1 2023-10-14 19:42:08,404 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:42:29,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1792471.3333333333, ans=0.1 2023-10-14 19:42:34,925 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:42:39,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1792471.3333333333, ans=0.0 2023-10-14 19:43:13,864 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1792564.6666666667, ans=0.1 2023-10-14 19:43:16,786 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1792564.6666666667, ans=0.0 2023-10-14 19:43:27,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1792611.3333333333, ans=0.125 2023-10-14 19:43:27,696 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1792611.3333333333, ans=0.0 2023-10-14 19:44:36,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1792751.3333333333, ans=0.125 2023-10-14 19:44:45,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.852e+02 2.057e+02 2.251e+02 3.007e+02, threshold=4.114e+02, percent-clipped=0.0 2023-10-14 19:44:51,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1792798.0, ans=0.125 2023-10-14 19:45:19,180 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.87 vs. limit=22.5 2023-10-14 19:45:38,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1792891.3333333333, ans=0.0 2023-10-14 19:45:42,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1792891.3333333333, ans=0.125 2023-10-14 19:45:52,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.84 vs. limit=22.5 2023-10-14 19:46:32,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1792984.6666666667, ans=0.0 2023-10-14 19:47:07,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1793078.0, ans=0.2 2023-10-14 19:47:24,637 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 19:47:50,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1793171.3333333333, ans=0.1 2023-10-14 19:47:50,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1793171.3333333333, ans=0.0 2023-10-14 19:47:51,276 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.84 vs. limit=15.0 2023-10-14 19:48:08,528 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1793218.0, ans=0.2 2023-10-14 19:48:18,259 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.91 vs. limit=12.0 2023-10-14 19:48:18,678 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.855e+02 2.035e+02 2.252e+02 3.047e+02, threshold=4.071e+02, percent-clipped=0.0 2023-10-14 19:48:26,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1793264.6666666667, ans=0.0 2023-10-14 19:48:36,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1793264.6666666667, ans=0.125 2023-10-14 19:49:11,630 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-10-14 19:49:45,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1793404.6666666667, ans=0.05 2023-10-14 19:49:56,487 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-10-14 19:49:57,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1793451.3333333333, ans=0.2 2023-10-14 19:50:44,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1793544.6666666667, ans=0.0 2023-10-14 19:51:14,719 INFO [train.py:1031] (0/4) Epoch 29, batch 2000, loss[loss=0.1803, simple_loss=0.279, pruned_loss=0.04079, over 16966.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2776, pruned_loss=0.04608, over 20752885.65 frames. ], batch size: 117, lr: 1.19e-03, grad_scale: 32.0 2023-10-14 19:51:24,056 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1793591.3333333333, ans=0.0 2023-10-14 19:52:22,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.833e+02 2.026e+02 2.265e+02 2.973e+02, threshold=4.051e+02, percent-clipped=0.0 2023-10-14 19:52:41,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1793731.3333333333, ans=0.0 2023-10-14 19:53:35,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1793824.6666666667, ans=0.2 2023-10-14 19:53:39,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1793871.3333333333, ans=0.125 2023-10-14 19:54:13,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1793918.0, ans=0.125 2023-10-14 19:54:17,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1793918.0, ans=0.125 2023-10-14 19:54:21,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1793918.0, ans=0.125 2023-10-14 19:54:36,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1793964.6666666667, ans=0.125 2023-10-14 19:54:47,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1794011.3333333333, ans=0.1 2023-10-14 19:54:55,827 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-10-14 19:55:01,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1794011.3333333333, ans=0.0 2023-10-14 19:55:11,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1794058.0, ans=0.0 2023-10-14 19:55:13,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1794058.0, ans=0.2 2023-10-14 19:55:16,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1794058.0, ans=0.0 2023-10-14 19:55:56,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1794104.6666666667, ans=0.125 2023-10-14 19:56:36,602 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.753e+02 2.006e+02 2.175e+02 3.076e+02, threshold=4.012e+02, percent-clipped=0.0 2023-10-14 19:56:58,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1794198.0, ans=0.0 2023-10-14 19:57:11,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1794244.6666666667, ans=0.125 2023-10-14 19:57:12,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1794244.6666666667, ans=0.2 2023-10-14 19:57:54,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1794338.0, ans=0.125 2023-10-14 19:59:03,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1794571.3333333333, ans=0.2 2023-10-14 19:59:12,625 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1794618.0, ans=0.125 2023-10-14 19:59:18,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1794618.0, ans=0.0 2023-10-14 19:59:18,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.884e+02 2.106e+02 2.348e+02 3.386e+02, threshold=4.211e+02, percent-clipped=0.0 2023-10-14 19:59:50,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1794758.0, ans=0.0 2023-10-14 20:00:03,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1794851.3333333333, ans=0.0 2023-10-14 20:00:11,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1794851.3333333333, ans=0.0 2023-10-14 20:00:38,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1794991.3333333333, ans=0.2 2023-10-14 20:00:39,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1794991.3333333333, ans=0.0 2023-10-14 20:00:52,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1795038.0, ans=0.125 2023-10-14 20:01:08,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.861e+02 2.071e+02 2.265e+02 3.523e+02, threshold=4.142e+02, percent-clipped=0.0 2023-10-14 20:01:22,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1795178.0, ans=0.5 2023-10-14 20:01:28,536 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1795178.0, ans=0.125 2023-10-14 20:01:41,372 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.15 vs. limit=22.5 2023-10-14 20:01:49,802 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=3.098e-02 2023-10-14 20:02:03,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-10-14 20:02:05,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1795318.0, ans=0.125 2023-10-14 20:02:21,117 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:02:28,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1795458.0, ans=0.0 2023-10-14 20:02:33,459 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1795458.0, ans=0.0 2023-10-14 20:02:38,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1795504.6666666667, ans=0.125 2023-10-14 20:02:55,226 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-10-14 20:02:55,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-10-14 20:02:58,113 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.926e+02 2.110e+02 2.307e+02 3.229e+02, threshold=4.219e+02, percent-clipped=0.0 2023-10-14 20:02:59,592 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.96 vs. limit=15.0 2023-10-14 20:03:04,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1795598.0, ans=0.125 2023-10-14 20:03:33,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1795738.0, ans=0.0 2023-10-14 20:03:33,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1795738.0, ans=0.125 2023-10-14 20:03:51,179 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1795784.6666666667, ans=0.1 2023-10-14 20:04:10,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1795878.0, ans=0.125 2023-10-14 20:04:14,519 INFO [train.py:1031] (0/4) Epoch 29, batch 2500, loss[loss=0.2017, simple_loss=0.2986, pruned_loss=0.05239, over 16926.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2777, pruned_loss=0.04607, over 23436468.76 frames. ], batch size: 77, lr: 1.19e-03, grad_scale: 32.0 2023-10-14 20:04:16,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1795924.6666666667, ans=0.2 2023-10-14 20:04:44,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.853e+02 1.981e+02 2.229e+02 2.879e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-14 20:05:10,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1796158.0, ans=0.125 2023-10-14 20:05:12,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-10-14 20:05:13,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.38 vs. limit=15.0 2023-10-14 20:05:16,280 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:05:18,420 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-10-14 20:05:23,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1796204.6666666667, ans=0.0 2023-10-14 20:05:23,743 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.98 vs. limit=12.0 2023-10-14 20:05:31,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1796251.3333333333, ans=0.2 2023-10-14 20:05:40,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1796298.0, ans=0.0 2023-10-14 20:05:45,822 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.94 vs. limit=22.5 2023-10-14 20:06:10,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1796438.0, ans=0.09899494936611666 2023-10-14 20:06:14,953 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1796438.0, ans=0.125 2023-10-14 20:06:19,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1796438.0, ans=0.125 2023-10-14 20:06:28,215 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.65 vs. limit=15.0 2023-10-14 20:06:33,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.894e+02 2.084e+02 2.345e+02 3.546e+02, threshold=4.169e+02, percent-clipped=0.0 2023-10-14 20:06:36,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1796531.3333333333, ans=0.125 2023-10-14 20:06:47,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1796578.0, ans=0.125 2023-10-14 20:06:54,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1796578.0, ans=0.0 2023-10-14 20:07:07,929 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1796671.3333333333, ans=0.0 2023-10-14 20:07:28,219 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-10-14 20:07:29,541 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2023-10-14 20:07:42,593 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1796811.3333333333, ans=0.0 2023-10-14 20:08:26,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.844e+02 1.986e+02 2.209e+02 3.111e+02, threshold=3.971e+02, percent-clipped=0.0 2023-10-14 20:08:26,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1796998.0, ans=0.125 2023-10-14 20:08:31,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1796998.0, ans=0.025 2023-10-14 20:09:12,134 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1797138.0, ans=0.0 2023-10-14 20:09:12,462 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.03 vs. limit=15.0 2023-10-14 20:09:38,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1797231.3333333333, ans=0.125 2023-10-14 20:09:48,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1797278.0, ans=0.0 2023-10-14 20:09:55,302 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-10-14 20:09:55,366 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.75 vs. limit=10.0 2023-10-14 20:09:59,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1797324.6666666667, ans=0.125 2023-10-14 20:10:15,217 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=12.0 2023-10-14 20:10:24,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.27 vs. limit=15.0 2023-10-14 20:10:24,751 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.825e+02 2.008e+02 2.155e+02 3.095e+02, threshold=4.016e+02, percent-clipped=0.0 2023-10-14 20:10:30,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1797464.6666666667, ans=0.0 2023-10-14 20:10:43,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1797511.3333333333, ans=0.0 2023-10-14 20:11:01,868 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1797604.6666666667, ans=0.2 2023-10-14 20:11:13,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1797604.6666666667, ans=0.0 2023-10-14 20:11:24,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1797651.3333333333, ans=0.0 2023-10-14 20:11:25,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1797698.0, ans=0.125 2023-10-14 20:11:30,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1797698.0, ans=0.0 2023-10-14 20:11:40,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1797744.6666666667, ans=0.125 2023-10-14 20:11:59,047 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.50 vs. limit=22.5 2023-10-14 20:12:02,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1797838.0, ans=0.1 2023-10-14 20:12:25,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.888e+02 2.037e+02 2.237e+02 3.185e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-14 20:12:45,759 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.76 vs. limit=15.0 2023-10-14 20:12:49,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1798024.6666666667, ans=0.0 2023-10-14 20:12:54,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1798024.6666666667, ans=0.125 2023-10-14 20:13:11,152 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.55 vs. limit=15.0 2023-10-14 20:13:13,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1798118.0, ans=0.125 2023-10-14 20:13:41,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1798258.0, ans=0.125 2023-10-14 20:13:41,641 INFO [train.py:1031] (0/4) Epoch 29, batch 3000, loss[loss=0.1907, simple_loss=0.2835, pruned_loss=0.04895, over 16941.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2771, pruned_loss=0.04613, over 25518502.12 frames. ], batch size: 123, lr: 1.19e-03, grad_scale: 16.0 2023-10-14 20:14:15,684 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.809e+02 1.925e+02 2.139e+02 2.753e+02, threshold=3.849e+02, percent-clipped=0.0 2023-10-14 20:14:35,905 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:14:44,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1798491.3333333333, ans=0.125 2023-10-14 20:14:46,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1798491.3333333333, ans=0.1 2023-10-14 20:14:49,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=1798538.0, ans=0.5 2023-10-14 20:14:51,710 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-10-14 20:15:16,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1798631.3333333333, ans=0.125 2023-10-14 20:15:18,298 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1798631.3333333333, ans=0.0 2023-10-14 20:15:32,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1798678.0, ans=0.0 2023-10-14 20:15:42,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1798724.6666666667, ans=0.1 2023-10-14 20:15:45,169 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-10-14 20:15:54,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1798771.3333333333, ans=0.0 2023-10-14 20:16:10,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.845e+02 2.043e+02 2.309e+02 3.390e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-14 20:16:15,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1798864.6666666667, ans=0.125 2023-10-14 20:16:23,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1798911.3333333333, ans=0.2 2023-10-14 20:16:23,897 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1798911.3333333333, ans=0.0 2023-10-14 20:16:36,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1798958.0, ans=0.1 2023-10-14 20:16:42,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1799004.6666666667, ans=0.0 2023-10-14 20:16:48,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1799004.6666666667, ans=0.2 2023-10-14 20:17:19,895 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1799144.6666666667, ans=0.0 2023-10-14 20:17:22,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1799144.6666666667, ans=0.09899494936611666 2023-10-14 20:17:24,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=8.0 2023-10-14 20:17:29,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1799191.3333333333, ans=0.02 2023-10-14 20:17:33,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1799191.3333333333, ans=0.1 2023-10-14 20:18:07,235 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.861e+02 2.022e+02 2.187e+02 2.766e+02, threshold=4.044e+02, percent-clipped=0.0 2023-10-14 20:18:31,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1799424.6666666667, ans=0.125 2023-10-14 20:18:33,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1799424.6666666667, ans=0.125 2023-10-14 20:18:35,701 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-10-14 20:18:55,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1799518.0, ans=0.125 2023-10-14 20:18:57,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1799518.0, ans=0.125 2023-10-14 20:19:05,631 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-10-14 20:19:22,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1799611.3333333333, ans=0.125 2023-10-14 20:19:25,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1799611.3333333333, ans=0.1 2023-10-14 20:19:38,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1799704.6666666667, ans=0.125 2023-10-14 20:19:43,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1799704.6666666667, ans=0.2 2023-10-14 20:19:43,352 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1799704.6666666667, ans=0.1 2023-10-14 20:19:50,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1799751.3333333333, ans=0.125 2023-10-14 20:19:52,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1799751.3333333333, ans=0.125 2023-10-14 20:19:58,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1799751.3333333333, ans=0.0 2023-10-14 20:19:58,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1799751.3333333333, ans=0.125 2023-10-14 20:20:01,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.869e+02 2.018e+02 2.263e+02 3.119e+02, threshold=4.035e+02, percent-clipped=0.0 2023-10-14 20:20:06,888 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2023-10-14 20:20:16,543 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-10-14 20:20:29,266 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1799891.3333333333, ans=0.07 2023-10-14 20:20:29,369 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1799891.3333333333, ans=0.125 2023-10-14 20:20:32,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1799938.0, ans=0.125 2023-10-14 20:20:45,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1799984.6666666667, ans=0.0 2023-10-14 20:21:01,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1800031.3333333333, ans=0.2 2023-10-14 20:21:06,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1800031.3333333333, ans=0.07 2023-10-14 20:21:24,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1800124.6666666667, ans=0.0 2023-10-14 20:21:40,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1800171.3333333333, ans=0.0 2023-10-14 20:21:42,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1800218.0, ans=0.0 2023-10-14 20:21:49,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1800218.0, ans=0.125 2023-10-14 20:21:53,756 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.31 vs. limit=6.0 2023-10-14 20:21:56,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.848e+02 2.086e+02 2.308e+02 3.288e+02, threshold=4.173e+02, percent-clipped=0.0 2023-10-14 20:21:57,167 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1800264.6666666667, ans=0.0 2023-10-14 20:22:10,321 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-10-14 20:22:33,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1800404.6666666667, ans=0.125 2023-10-14 20:22:34,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.70 vs. limit=10.0 2023-10-14 20:22:35,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1800404.6666666667, ans=0.1 2023-10-14 20:22:47,640 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1800451.3333333333, ans=0.125 2023-10-14 20:22:53,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1800498.0, ans=0.125 2023-10-14 20:23:12,066 INFO [train.py:1031] (0/4) Epoch 29, batch 3500, loss[loss=0.1918, simple_loss=0.2881, pruned_loss=0.04777, over 16922.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.277, pruned_loss=0.04623, over 27114708.74 frames. ], batch size: 138, lr: 1.19e-03, grad_scale: 8.0 2023-10-14 20:23:28,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1800638.0, ans=0.125 2023-10-14 20:23:46,798 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.94 vs. limit=22.5 2023-10-14 20:23:47,075 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 1.863e+02 2.043e+02 2.264e+02 3.629e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-14 20:23:59,858 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.03 vs. limit=15.0 2023-10-14 20:24:22,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1800871.3333333333, ans=0.125 2023-10-14 20:24:28,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1800871.3333333333, ans=0.125 2023-10-14 20:24:38,082 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1800918.0, ans=0.125 2023-10-14 20:24:51,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1800964.6666666667, ans=0.0 2023-10-14 20:24:55,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1801011.3333333333, ans=0.1 2023-10-14 20:25:10,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1801058.0, ans=0.0 2023-10-14 20:25:21,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1801104.6666666667, ans=0.125 2023-10-14 20:25:22,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1801104.6666666667, ans=0.2 2023-10-14 20:25:46,546 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.888e+02 2.064e+02 2.407e+02 3.450e+02, threshold=4.128e+02, percent-clipped=0.0 2023-10-14 20:25:52,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1801198.0, ans=0.04949747468305833 2023-10-14 20:25:55,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1801244.6666666667, ans=0.07 2023-10-14 20:25:55,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=22.5 2023-10-14 20:26:17,346 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-10-14 20:26:20,271 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.86 vs. limit=15.0 2023-10-14 20:26:25,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1801384.6666666667, ans=0.1 2023-10-14 20:27:08,221 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-10-14 20:27:13,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1801571.3333333333, ans=0.1 2023-10-14 20:27:41,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.780e+02 1.942e+02 2.129e+02 3.439e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-14 20:27:42,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1801664.6666666667, ans=0.0 2023-10-14 20:27:45,260 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1801664.6666666667, ans=0.1 2023-10-14 20:27:49,991 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1801711.3333333333, ans=0.125 2023-10-14 20:27:51,340 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1801711.3333333333, ans=15.0 2023-10-14 20:28:22,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1801851.3333333333, ans=0.125 2023-10-14 20:28:32,122 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.91 vs. limit=12.0 2023-10-14 20:28:40,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1801898.0, ans=10.0 2023-10-14 20:28:49,017 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1801944.6666666667, ans=0.09899494936611666 2023-10-14 20:28:55,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1801944.6666666667, ans=0.2 2023-10-14 20:28:57,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1801991.3333333333, ans=0.125 2023-10-14 20:29:18,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1802038.0, ans=0.0 2023-10-14 20:29:25,921 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-10-14 20:29:31,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1802131.3333333333, ans=0.125 2023-10-14 20:29:34,738 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.799e+02 1.973e+02 2.142e+02 3.160e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-14 20:29:44,981 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1802178.0, ans=0.2 2023-10-14 20:29:55,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1802224.6666666667, ans=0.2 2023-10-14 20:29:56,668 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.35 vs. limit=15.0 2023-10-14 20:29:59,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1802224.6666666667, ans=0.125 2023-10-14 20:30:05,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1802271.3333333333, ans=0.2 2023-10-14 20:30:54,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1802458.0, ans=0.1 2023-10-14 20:31:17,836 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1802551.3333333333, ans=0.125 2023-10-14 20:31:19,618 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1802598.0, ans=0.125 2023-10-14 20:31:22,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.775e+02 1.963e+02 2.284e+02 3.123e+02, threshold=3.926e+02, percent-clipped=0.0 2023-10-14 20:31:27,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-10-14 20:31:29,238 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1802644.6666666667, ans=0.2 2023-10-14 20:31:41,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1802691.3333333333, ans=0.2 2023-10-14 20:32:07,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1802784.6666666667, ans=0.125 2023-10-14 20:32:11,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1802784.6666666667, ans=0.125 2023-10-14 20:32:37,532 INFO [train.py:1031] (0/4) Epoch 29, batch 4000, loss[loss=0.1998, simple_loss=0.2956, pruned_loss=0.05198, over 16613.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2765, pruned_loss=0.04631, over 28357315.41 frames. ], batch size: 219, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 20:32:51,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1802971.3333333333, ans=0.0 2023-10-14 20:33:04,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1803018.0, ans=0.125 2023-10-14 20:33:15,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.884e+02 2.058e+02 2.269e+02 3.422e+02, threshold=4.116e+02, percent-clipped=0.0 2023-10-14 20:33:16,069 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1803064.6666666667, ans=0.0 2023-10-14 20:33:17,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1803064.6666666667, ans=0.125 2023-10-14 20:33:34,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1803158.0, ans=0.125 2023-10-14 20:33:36,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1803158.0, ans=0.0 2023-10-14 20:33:39,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1803158.0, ans=0.125 2023-10-14 20:34:40,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1803438.0, ans=0.125 2023-10-14 20:34:47,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1803438.0, ans=0.1 2023-10-14 20:34:48,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=1803438.0, ans=10.0 2023-10-14 20:34:48,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1803438.0, ans=0.0 2023-10-14 20:34:50,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1803438.0, ans=0.2 2023-10-14 20:34:50,949 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1803438.0, ans=0.125 2023-10-14 20:34:55,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1803484.6666666667, ans=0.125 2023-10-14 20:35:09,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.932e+02 2.098e+02 2.393e+02 3.008e+02, threshold=4.196e+02, percent-clipped=0.0 2023-10-14 20:35:21,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1803578.0, ans=0.125 2023-10-14 20:35:32,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1803624.6666666667, ans=0.0 2023-10-14 20:35:50,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1803671.3333333333, ans=0.125 2023-10-14 20:36:01,920 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.53 vs. limit=12.0 2023-10-14 20:36:12,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1803718.0, ans=0.07 2023-10-14 20:36:13,007 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-10-14 20:36:21,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1803764.6666666667, ans=0.125 2023-10-14 20:36:22,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1803764.6666666667, ans=0.04949747468305833 2023-10-14 20:36:45,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1803858.0, ans=0.125 2023-10-14 20:36:52,856 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.28 vs. limit=15.0 2023-10-14 20:37:17,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1803998.0, ans=0.125 2023-10-14 20:37:18,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.845e+02 2.013e+02 2.174e+02 3.241e+02, threshold=4.025e+02, percent-clipped=0.0 2023-10-14 20:37:26,200 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:37:35,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1804091.3333333333, ans=0.0 2023-10-14 20:37:38,452 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1804091.3333333333, ans=0.0 2023-10-14 20:37:39,223 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1804091.3333333333, ans=0.125 2023-10-14 20:37:40,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1804091.3333333333, ans=0.125 2023-10-14 20:37:58,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1804184.6666666667, ans=0.0 2023-10-14 20:38:03,940 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.51 vs. limit=22.5 2023-10-14 20:38:08,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1804231.3333333333, ans=0.0 2023-10-14 20:38:20,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1804278.0, ans=0.125 2023-10-14 20:38:23,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1804278.0, ans=0.125 2023-10-14 20:38:42,270 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.32 vs. limit=10.0 2023-10-14 20:38:53,519 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-10-14 20:39:06,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.831e+02 1.976e+02 2.287e+02 2.865e+02, threshold=3.951e+02, percent-clipped=0.0 2023-10-14 20:39:22,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1804558.0, ans=0.125 2023-10-14 20:39:30,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1804558.0, ans=0.2 2023-10-14 20:40:42,129 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.49 vs. limit=10.0 2023-10-14 20:40:47,767 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1804884.6666666667, ans=0.05 2023-10-14 20:40:53,781 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.91 vs. limit=15.0 2023-10-14 20:41:05,236 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.972e+02 2.127e+02 2.320e+02 3.128e+02, threshold=4.253e+02, percent-clipped=0.0 2023-10-14 20:41:31,528 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.90 vs. limit=12.0 2023-10-14 20:41:36,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1805024.6666666667, ans=0.1 2023-10-14 20:41:38,730 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-10-14 20:41:53,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1805118.0, ans=0.125 2023-10-14 20:42:02,966 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-10-14 20:42:07,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1805164.6666666667, ans=0.125 2023-10-14 20:42:21,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1805211.3333333333, ans=0.2 2023-10-14 20:42:24,034 INFO [train.py:1031] (0/4) Epoch 29, batch 4500, loss[loss=0.1738, simple_loss=0.2424, pruned_loss=0.05258, over 12501.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2772, pruned_loss=0.04624, over 29365288.86 frames. ], batch size: 440, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 20:43:01,113 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.825e+02 1.986e+02 2.219e+02 2.817e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-14 20:43:10,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1805444.6666666667, ans=0.125 2023-10-14 20:43:18,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.78 vs. limit=15.0 2023-10-14 20:43:19,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1805491.3333333333, ans=0.125 2023-10-14 20:43:32,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1805538.0, ans=0.125 2023-10-14 20:43:35,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1805538.0, ans=0.125 2023-10-14 20:43:39,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1805584.6666666667, ans=0.1 2023-10-14 20:43:42,237 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.45 vs. limit=12.0 2023-10-14 20:43:46,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1805584.6666666667, ans=0.125 2023-10-14 20:43:58,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1805678.0, ans=0.125 2023-10-14 20:44:01,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1805678.0, ans=0.0 2023-10-14 20:44:13,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1805724.6666666667, ans=0.07 2023-10-14 20:44:22,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1805771.3333333333, ans=0.0 2023-10-14 20:44:44,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.816e+02 1.994e+02 2.220e+02 3.314e+02, threshold=3.987e+02, percent-clipped=0.0 2023-10-14 20:45:18,293 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.98 vs. limit=12.0 2023-10-14 20:45:34,951 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1806098.0, ans=0.2 2023-10-14 20:45:51,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1806144.6666666667, ans=0.125 2023-10-14 20:45:54,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.39 vs. limit=15.0 2023-10-14 20:46:02,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-10-14 20:46:20,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1806284.6666666667, ans=0.125 2023-10-14 20:46:31,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.845e+02 2.040e+02 2.261e+02 3.021e+02, threshold=4.080e+02, percent-clipped=0.0 2023-10-14 20:46:46,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1806424.6666666667, ans=0.1 2023-10-14 20:46:56,328 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.07 vs. limit=15.0 2023-10-14 20:47:18,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-10-14 20:47:58,467 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.01 vs. limit=15.0 2023-10-14 20:48:21,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.910e+02 2.032e+02 2.255e+02 3.187e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-14 20:49:19,306 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1807031.3333333333, ans=0.125 2023-10-14 20:49:23,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1807078.0, ans=0.0 2023-10-14 20:49:23,545 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.78 vs. limit=6.0 2023-10-14 20:49:24,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1807078.0, ans=0.125 2023-10-14 20:49:34,574 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1807124.6666666667, ans=0.125 2023-10-14 20:49:34,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1807124.6666666667, ans=0.125 2023-10-14 20:49:39,508 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2023-10-14 20:49:44,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1807171.3333333333, ans=0.125 2023-10-14 20:50:05,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1807218.0, ans=0.2 2023-10-14 20:50:17,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.848e+02 2.133e+02 2.322e+02 2.866e+02, threshold=4.266e+02, percent-clipped=0.0 2023-10-14 20:50:17,699 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1807264.6666666667, ans=0.0 2023-10-14 20:50:27,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1807311.3333333333, ans=0.1 2023-10-14 20:50:35,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1807358.0, ans=0.0 2023-10-14 20:50:36,142 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-10-14 20:50:41,024 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=1807358.0, ans=0.02 2023-10-14 20:51:08,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-10-14 20:51:23,727 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.07 vs. limit=15.0 2023-10-14 20:51:25,909 INFO [train.py:1031] (0/4) Epoch 29, batch 5000, loss[loss=0.2004, simple_loss=0.291, pruned_loss=0.05487, over 16944.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2769, pruned_loss=0.04629, over 30127058.77 frames. ], batch size: 72, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 20:51:41,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1807638.0, ans=0.125 2023-10-14 20:52:07,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.96 vs. limit=6.0 2023-10-14 20:52:07,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.847e+02 2.026e+02 2.209e+02 3.172e+02, threshold=4.053e+02, percent-clipped=0.0 2023-10-14 20:52:09,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1807731.3333333333, ans=0.0 2023-10-14 20:52:13,243 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 20:52:44,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1807871.3333333333, ans=0.125 2023-10-14 20:52:54,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1807918.0, ans=0.1 2023-10-14 20:53:21,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1808011.3333333333, ans=0.125 2023-10-14 20:53:24,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1808058.0, ans=0.125 2023-10-14 20:53:25,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1808058.0, ans=0.125 2023-10-14 20:53:29,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1808058.0, ans=0.125 2023-10-14 20:54:00,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.819e+02 1.968e+02 2.171e+02 3.032e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-14 20:54:04,585 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.16 vs. limit=22.5 2023-10-14 20:54:13,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1808244.6666666667, ans=0.2 2023-10-14 20:54:17,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1808291.3333333333, ans=0.0 2023-10-14 20:54:30,830 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.69 vs. limit=15.0 2023-10-14 20:54:38,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1808384.6666666667, ans=0.07 2023-10-14 20:54:51,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1808431.3333333333, ans=0.1 2023-10-14 20:55:02,154 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1808478.0, ans=0.1 2023-10-14 20:55:08,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1808478.0, ans=0.125 2023-10-14 20:55:12,595 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1808524.6666666667, ans=0.125 2023-10-14 20:55:12,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1808524.6666666667, ans=0.0 2023-10-14 20:55:32,778 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.64 vs. limit=15.0 2023-10-14 20:55:39,740 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1808618.0, ans=0.125 2023-10-14 20:55:48,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.841e+02 2.072e+02 2.300e+02 3.120e+02, threshold=4.145e+02, percent-clipped=0.0 2023-10-14 20:55:53,515 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.72 vs. limit=15.0 2023-10-14 20:56:08,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808758.0, ans=0.1 2023-10-14 20:56:35,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1808851.3333333333, ans=0.125 2023-10-14 20:56:35,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1808851.3333333333, ans=0.125 2023-10-14 20:56:37,583 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.31 vs. limit=15.0 2023-10-14 20:56:49,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1808944.6666666667, ans=0.125 2023-10-14 20:57:16,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1809038.0, ans=0.125 2023-10-14 20:57:24,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.01 vs. limit=6.0 2023-10-14 20:57:36,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1809131.3333333333, ans=0.09899494936611666 2023-10-14 20:57:45,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.802e+02 1.981e+02 2.214e+02 2.648e+02, threshold=3.962e+02, percent-clipped=0.0 2023-10-14 20:57:56,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1809178.0, ans=0.125 2023-10-14 20:58:10,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1809224.6666666667, ans=0.125 2023-10-14 20:58:25,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1809318.0, ans=0.09899494936611666 2023-10-14 20:58:26,489 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.18 vs. limit=15.0 2023-10-14 20:58:31,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1809318.0, ans=0.0 2023-10-14 20:58:43,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809411.3333333333, ans=0.1 2023-10-14 20:58:49,832 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.08 vs. limit=15.0 2023-10-14 20:59:00,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1809458.0, ans=0.125 2023-10-14 20:59:20,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809551.3333333333, ans=0.1 2023-10-14 20:59:20,754 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1809551.3333333333, ans=0.0 2023-10-14 20:59:25,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1809598.0, ans=0.125 2023-10-14 20:59:28,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1809598.0, ans=0.125 2023-10-14 20:59:32,719 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.747e+02 1.887e+02 2.056e+02 3.223e+02, threshold=3.775e+02, percent-clipped=0.0 2023-10-14 20:59:38,249 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1809644.6666666667, ans=0.125 2023-10-14 20:59:50,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1809691.3333333333, ans=0.0 2023-10-14 20:59:51,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1809691.3333333333, ans=0.125 2023-10-14 21:00:06,754 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.59 vs. limit=15.0 2023-10-14 21:00:26,670 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1809831.3333333333, ans=0.05 2023-10-14 21:00:32,665 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1809878.0, ans=0.1 2023-10-14 21:00:33,790 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.81 vs. limit=15.0 2023-10-14 21:00:35,343 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1809878.0, ans=0.0 2023-10-14 21:00:40,716 INFO [train.py:1031] (0/4) Epoch 29, batch 5500, loss[loss=0.191, simple_loss=0.2795, pruned_loss=0.05129, over 16590.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2768, pruned_loss=0.04612, over 30748118.10 frames. ], batch size: 61, lr: 1.18e-03, grad_scale: 16.0 2023-10-14 21:01:08,575 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1810018.0, ans=0.125 2023-10-14 21:01:17,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.834e+02 1.953e+02 2.151e+02 2.674e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-14 21:01:22,026 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1810111.3333333333, ans=15.0 2023-10-14 21:01:41,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1810158.0, ans=0.125 2023-10-14 21:01:53,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1810251.3333333333, ans=0.125 2023-10-14 21:01:54,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1810251.3333333333, ans=0.1 2023-10-14 21:02:00,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1810251.3333333333, ans=0.0 2023-10-14 21:02:15,680 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:02:20,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1810344.6666666667, ans=0.2 2023-10-14 21:02:33,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1810391.3333333333, ans=0.2 2023-10-14 21:03:00,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1810531.3333333333, ans=0.0 2023-10-14 21:03:02,382 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1810531.3333333333, ans=0.2 2023-10-14 21:03:03,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.810e+02 1.992e+02 2.250e+02 3.650e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-14 21:03:03,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1810531.3333333333, ans=0.125 2023-10-14 21:03:22,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1810624.6666666667, ans=0.04949747468305833 2023-10-14 21:03:39,892 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=15.0 2023-10-14 21:04:04,568 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-10-14 21:04:43,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1810951.3333333333, ans=0.0 2023-10-14 21:04:52,067 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1810998.0, ans=0.125 2023-10-14 21:04:57,839 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 1.873e+02 2.032e+02 2.299e+02 3.087e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-14 21:05:00,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1811044.6666666667, ans=0.125 2023-10-14 21:05:02,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1811044.6666666667, ans=0.125 2023-10-14 21:05:05,708 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.83 vs. limit=15.0 2023-10-14 21:05:08,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1811044.6666666667, ans=0.125 2023-10-14 21:05:09,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1811044.6666666667, ans=0.125 2023-10-14 21:05:13,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1811091.3333333333, ans=0.125 2023-10-14 21:05:13,899 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1811091.3333333333, ans=0.025 2023-10-14 21:05:28,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=1811138.0, ans=10.0 2023-10-14 21:05:29,863 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1811138.0, ans=0.05 2023-10-14 21:05:34,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1811184.6666666667, ans=0.125 2023-10-14 21:05:38,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1811184.6666666667, ans=0.2 2023-10-14 21:05:39,726 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-10-14 21:05:40,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.09 vs. limit=22.5 2023-10-14 21:05:52,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1811231.3333333333, ans=0.1 2023-10-14 21:05:55,050 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.31 vs. limit=10.0 2023-10-14 21:06:33,941 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1811418.0, ans=0.0 2023-10-14 21:06:49,169 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.801e+02 1.999e+02 2.250e+02 3.629e+02, threshold=3.999e+02, percent-clipped=0.0 2023-10-14 21:06:56,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1811511.3333333333, ans=0.125 2023-10-14 21:07:05,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1811558.0, ans=0.05 2023-10-14 21:07:14,295 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1811558.0, ans=0.125 2023-10-14 21:07:21,087 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1811604.6666666667, ans=0.0 2023-10-14 21:07:37,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1811698.0, ans=0.125 2023-10-14 21:07:43,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1811698.0, ans=0.1 2023-10-14 21:07:46,251 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811698.0, ans=0.1 2023-10-14 21:08:00,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-10-14 21:08:18,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1811838.0, ans=0.125 2023-10-14 21:08:23,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1811884.6666666667, ans=0.0 2023-10-14 21:08:26,294 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-10-14 21:08:32,138 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1811884.6666666667, ans=0.125 2023-10-14 21:08:36,478 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.90 vs. limit=10.0 2023-10-14 21:08:42,088 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.866e+02 2.032e+02 2.352e+02 4.379e+02, threshold=4.064e+02, percent-clipped=1.0 2023-10-14 21:08:45,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1811978.0, ans=0.125 2023-10-14 21:09:08,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1812071.3333333333, ans=0.1 2023-10-14 21:09:12,273 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=15.0 2023-10-14 21:09:35,906 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1812164.6666666667, ans=0.2 2023-10-14 21:09:45,976 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1812211.3333333333, ans=0.125 2023-10-14 21:09:49,434 INFO [train.py:1031] (0/4) Epoch 29, batch 6000, loss[loss=0.1703, simple_loss=0.2715, pruned_loss=0.0345, over 16938.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2771, pruned_loss=0.04637, over 31193147.12 frames. ], batch size: 104, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 21:10:06,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1812304.6666666667, ans=0.0 2023-10-14 21:10:21,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1812398.0, ans=0.125 2023-10-14 21:10:29,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.877e+02 1.996e+02 2.206e+02 2.771e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-14 21:10:29,961 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1812398.0, ans=0.125 2023-10-14 21:10:54,063 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.61 vs. limit=15.0 2023-10-14 21:11:03,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1812584.6666666667, ans=0.125 2023-10-14 21:11:10,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1812584.6666666667, ans=0.125 2023-10-14 21:11:15,517 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.97 vs. limit=15.0 2023-10-14 21:11:32,327 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1812678.0, ans=0.1 2023-10-14 21:11:37,746 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-10-14 21:12:06,985 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-10-14 21:12:14,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1812864.6666666667, ans=0.125 2023-10-14 21:12:15,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.819e+02 1.970e+02 2.194e+02 2.752e+02, threshold=3.939e+02, percent-clipped=0.0 2023-10-14 21:13:09,628 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1813098.0, ans=0.2 2023-10-14 21:13:09,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1813098.0, ans=0.125 2023-10-14 21:13:31,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1813191.3333333333, ans=0.125 2023-10-14 21:13:39,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1813238.0, ans=0.125 2023-10-14 21:13:40,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1813238.0, ans=0.125 2023-10-14 21:14:03,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1813331.3333333333, ans=0.0 2023-10-14 21:14:04,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.968e+02 2.138e+02 2.510e+02 4.092e+02, threshold=4.276e+02, percent-clipped=1.0 2023-10-14 21:14:05,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1813331.3333333333, ans=0.0 2023-10-14 21:14:11,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1813378.0, ans=0.125 2023-10-14 21:14:13,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1813378.0, ans=0.125 2023-10-14 21:14:21,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1813424.6666666667, ans=0.1 2023-10-14 21:14:25,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1813424.6666666667, ans=0.0 2023-10-14 21:14:31,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1813471.3333333333, ans=0.0 2023-10-14 21:14:34,604 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=15.0 2023-10-14 21:14:37,936 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.01 vs. limit=10.0 2023-10-14 21:14:41,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1813518.0, ans=10.0 2023-10-14 21:14:56,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1813564.6666666667, ans=10.0 2023-10-14 21:15:23,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1813658.0, ans=0.125 2023-10-14 21:15:34,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1813704.6666666667, ans=0.1 2023-10-14 21:15:50,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1813798.0, ans=0.0 2023-10-14 21:15:56,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 1.852e+02 2.035e+02 2.232e+02 3.095e+02, threshold=4.070e+02, percent-clipped=0.0 2023-10-14 21:15:58,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1813798.0, ans=0.125 2023-10-14 21:16:02,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1813844.6666666667, ans=0.125 2023-10-14 21:16:10,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1813891.3333333333, ans=0.1 2023-10-14 21:16:18,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1813891.3333333333, ans=0.0 2023-10-14 21:16:18,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1813891.3333333333, ans=0.0 2023-10-14 21:16:27,870 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.66 vs. limit=15.0 2023-10-14 21:16:28,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1813938.0, ans=0.07 2023-10-14 21:16:46,917 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-14 21:16:53,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1814031.3333333333, ans=0.125 2023-10-14 21:17:27,314 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:17:57,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.851e+02 2.053e+02 2.310e+02 3.189e+02, threshold=4.107e+02, percent-clipped=0.0 2023-10-14 21:18:04,209 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.41 vs. limit=15.0 2023-10-14 21:18:27,123 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:19:08,662 INFO [train.py:1031] (0/4) Epoch 29, batch 6500, loss[loss=0.195, simple_loss=0.2796, pruned_loss=0.05515, over 15984.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2775, pruned_loss=0.04657, over 31528267.95 frames. ], batch size: 43, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 21:19:34,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1814684.6666666667, ans=0.0 2023-10-14 21:19:35,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1814684.6666666667, ans=0.125 2023-10-14 21:20:01,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.902e+02 2.090e+02 2.254e+02 4.148e+02, threshold=4.179e+02, percent-clipped=1.0 2023-10-14 21:20:06,005 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.552e-02 2023-10-14 21:20:25,912 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1814871.3333333333, ans=0.2 2023-10-14 21:20:36,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1814918.0, ans=0.2 2023-10-14 21:21:01,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1815011.3333333333, ans=0.1 2023-10-14 21:21:19,084 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1815058.0, ans=0.125 2023-10-14 21:21:20,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.70 vs. limit=15.0 2023-10-14 21:21:43,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-10-14 21:21:51,840 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.881e+02 2.032e+02 2.279e+02 3.221e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-14 21:22:05,395 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1815291.3333333333, ans=0.1 2023-10-14 21:22:11,421 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1815291.3333333333, ans=0.0 2023-10-14 21:22:14,673 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1815338.0, ans=0.0 2023-10-14 21:22:16,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1815338.0, ans=0.0 2023-10-14 21:23:07,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1815571.3333333333, ans=0.125 2023-10-14 21:23:40,233 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-10-14 21:23:41,723 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.810e+02 1.936e+02 2.218e+02 2.692e+02, threshold=3.873e+02, percent-clipped=0.0 2023-10-14 21:23:43,663 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.37 vs. limit=15.0 2023-10-14 21:23:47,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1815711.3333333333, ans=0.125 2023-10-14 21:24:00,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1815758.0, ans=0.0 2023-10-14 21:24:10,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1815804.6666666667, ans=0.125 2023-10-14 21:24:17,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1815851.3333333333, ans=0.2 2023-10-14 21:24:24,791 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1815851.3333333333, ans=0.0 2023-10-14 21:24:34,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1815898.0, ans=0.1 2023-10-14 21:25:44,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1816131.3333333333, ans=6.0 2023-10-14 21:25:47,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.791e+02 1.994e+02 2.227e+02 4.307e+02, threshold=3.988e+02, percent-clipped=1.0 2023-10-14 21:25:47,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1816178.0, ans=0.1 2023-10-14 21:25:59,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1816178.0, ans=0.1 2023-10-14 21:26:02,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1816224.6666666667, ans=0.125 2023-10-14 21:26:02,618 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.44 vs. limit=22.5 2023-10-14 21:26:09,085 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.37 vs. limit=6.0 2023-10-14 21:26:09,604 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1816224.6666666667, ans=0.0 2023-10-14 21:26:10,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1816224.6666666667, ans=0.125 2023-10-14 21:26:18,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1816271.3333333333, ans=0.125 2023-10-14 21:26:26,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1816318.0, ans=0.1 2023-10-14 21:26:33,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1816318.0, ans=0.0 2023-10-14 21:26:44,574 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-10-14 21:27:06,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1816458.0, ans=0.0 2023-10-14 21:27:09,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1816504.6666666667, ans=0.125 2023-10-14 21:27:16,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1816504.6666666667, ans=22.5 2023-10-14 21:27:21,309 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1816551.3333333333, ans=0.125 2023-10-14 21:27:22,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1816551.3333333333, ans=0.125 2023-10-14 21:27:22,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1816551.3333333333, ans=15.0 2023-10-14 21:27:23,256 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1816551.3333333333, ans=0.125 2023-10-14 21:27:33,707 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-10-14 21:27:35,537 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.77 vs. limit=10.0 2023-10-14 21:27:39,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.845e+02 2.016e+02 2.288e+02 2.929e+02, threshold=4.033e+02, percent-clipped=0.0 2023-10-14 21:27:55,797 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1816691.3333333333, ans=0.125 2023-10-14 21:27:56,159 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.79 vs. limit=15.0 2023-10-14 21:28:00,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1816738.0, ans=0.0 2023-10-14 21:28:01,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1816738.0, ans=0.125 2023-10-14 21:28:02,909 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=15.0 2023-10-14 21:28:10,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1816784.6666666667, ans=0.1 2023-10-14 21:28:42,097 INFO [train.py:1031] (0/4) Epoch 29, batch 7000, loss[loss=0.1772, simple_loss=0.275, pruned_loss=0.03973, over 16275.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2778, pruned_loss=0.04637, over 31828684.47 frames. ], batch size: 50, lr: 1.18e-03, grad_scale: 16.0 2023-10-14 21:29:19,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1817064.6666666667, ans=0.125 2023-10-14 21:29:21,038 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-10-14 21:29:24,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.91 vs. limit=22.5 2023-10-14 21:29:27,524 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.869e+02 2.054e+02 2.300e+02 3.218e+02, threshold=4.108e+02, percent-clipped=0.0 2023-10-14 21:29:29,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817111.3333333333, ans=0.1 2023-10-14 21:29:31,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1817111.3333333333, ans=0.0 2023-10-14 21:29:32,354 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-10-14 21:29:36,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817111.3333333333, ans=0.1 2023-10-14 21:29:51,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817204.6666666667, ans=0.1 2023-10-14 21:29:59,163 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1817251.3333333333, ans=0.2 2023-10-14 21:30:27,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1817344.6666666667, ans=0.0 2023-10-14 21:30:47,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1817438.0, ans=0.0 2023-10-14 21:30:49,737 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1817438.0, ans=0.125 2023-10-14 21:30:59,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=1817484.6666666667, ans=6.0 2023-10-14 21:31:05,861 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1817531.3333333333, ans=10.0 2023-10-14 21:31:11,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1817531.3333333333, ans=0.125 2023-10-14 21:31:15,853 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.872e+02 2.048e+02 2.366e+02 3.478e+02, threshold=4.095e+02, percent-clipped=0.0 2023-10-14 21:31:38,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1817671.3333333333, ans=0.2 2023-10-14 21:31:41,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1817671.3333333333, ans=0.95 2023-10-14 21:31:42,087 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-10-14 21:31:49,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817718.0, ans=0.1 2023-10-14 21:32:17,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1817858.0, ans=0.1 2023-10-14 21:32:51,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1817951.3333333333, ans=0.125 2023-10-14 21:32:58,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1817951.3333333333, ans=0.1 2023-10-14 21:33:03,069 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2023-10-14 21:33:04,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1817998.0, ans=0.025 2023-10-14 21:33:09,415 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1817998.0, ans=0.125 2023-10-14 21:33:17,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.779e+02 1.932e+02 2.101e+02 2.782e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-14 21:33:25,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-10-14 21:33:34,506 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.75 vs. limit=15.0 2023-10-14 21:34:00,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1818184.6666666667, ans=0.1 2023-10-14 21:34:07,770 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1818231.3333333333, ans=0.0 2023-10-14 21:34:29,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1818324.6666666667, ans=0.125 2023-10-14 21:34:56,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1818418.0, ans=0.0 2023-10-14 21:35:12,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.797e+02 1.929e+02 2.183e+02 2.836e+02, threshold=3.858e+02, percent-clipped=0.0 2023-10-14 21:35:12,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1818511.3333333333, ans=0.1 2023-10-14 21:35:18,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1818511.3333333333, ans=0.1 2023-10-14 21:35:50,191 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1818651.3333333333, ans=0.125 2023-10-14 21:35:54,619 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=22.5 2023-10-14 21:36:11,278 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1818744.6666666667, ans=0.125 2023-10-14 21:36:15,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1818744.6666666667, ans=0.2 2023-10-14 21:36:16,287 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-10-14 21:36:35,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1818838.0, ans=0.125 2023-10-14 21:36:36,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1818838.0, ans=0.05 2023-10-14 21:36:40,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1818838.0, ans=0.0 2023-10-14 21:36:41,156 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1818884.6666666667, ans=0.07 2023-10-14 21:36:41,454 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-10-14 21:36:45,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1818884.6666666667, ans=0.0 2023-10-14 21:37:04,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.878e+02 2.109e+02 2.518e+02 3.727e+02, threshold=4.218e+02, percent-clipped=0.0 2023-10-14 21:37:15,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1819024.6666666667, ans=0.125 2023-10-14 21:37:25,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1819071.3333333333, ans=0.0 2023-10-14 21:37:40,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1819118.0, ans=0.2 2023-10-14 21:37:43,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1819118.0, ans=0.0 2023-10-14 21:37:54,301 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1819164.6666666667, ans=0.2 2023-10-14 21:38:09,187 INFO [train.py:1031] (0/4) Epoch 29, batch 7500, loss[loss=0.1713, simple_loss=0.2692, pruned_loss=0.03676, over 16871.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2778, pruned_loss=0.04654, over 32037727.61 frames. ], batch size: 104, lr: 1.18e-03, grad_scale: 16.0 2023-10-14 21:38:10,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.95 vs. limit=6.0 2023-10-14 21:38:16,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.40 vs. limit=22.5 2023-10-14 21:38:43,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1819398.0, ans=0.125 2023-10-14 21:38:53,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.878e+02 2.070e+02 2.311e+02 3.269e+02, threshold=4.140e+02, percent-clipped=0.0 2023-10-14 21:38:59,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.35 vs. limit=10.0 2023-10-14 21:39:01,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1819444.6666666667, ans=0.125 2023-10-14 21:39:04,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1819444.6666666667, ans=0.125 2023-10-14 21:39:16,344 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1819538.0, ans=0.125 2023-10-14 21:39:46,298 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-10-14 21:39:50,177 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1819678.0, ans=0.04949747468305833 2023-10-14 21:39:59,564 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-10-14 21:40:09,868 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:40:34,679 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.90 vs. limit=15.0 2023-10-14 21:40:41,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1819864.6666666667, ans=0.125 2023-10-14 21:40:57,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.819e+02 1.912e+02 2.113e+02 3.176e+02, threshold=3.823e+02, percent-clipped=0.0 2023-10-14 21:41:22,678 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.08 vs. limit=15.0 2023-10-14 21:41:30,346 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1820051.3333333333, ans=0.125 2023-10-14 21:41:30,386 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1820051.3333333333, ans=0.125 2023-10-14 21:41:33,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1820051.3333333333, ans=0.125 2023-10-14 21:41:33,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1820051.3333333333, ans=0.1 2023-10-14 21:41:38,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1820051.3333333333, ans=0.125 2023-10-14 21:41:47,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1820098.0, ans=0.125 2023-10-14 21:42:07,560 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1820191.3333333333, ans=0.125 2023-10-14 21:42:07,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1820191.3333333333, ans=0.125 2023-10-14 21:42:07,910 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.19 vs. limit=15.0 2023-10-14 21:42:22,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.06 vs. limit=15.0 2023-10-14 21:42:44,646 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1820331.3333333333, ans=0.0 2023-10-14 21:42:45,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1820331.3333333333, ans=0.0 2023-10-14 21:42:49,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.819e+02 2.007e+02 2.226e+02 3.209e+02, threshold=4.014e+02, percent-clipped=0.0 2023-10-14 21:43:12,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1820471.3333333333, ans=0.0 2023-10-14 21:43:15,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1820471.3333333333, ans=0.1 2023-10-14 21:43:18,649 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1820471.3333333333, ans=0.125 2023-10-14 21:43:37,002 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-10-14 21:43:37,576 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:43:45,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1820611.3333333333, ans=0.0 2023-10-14 21:43:51,862 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1820658.0, ans=0.05 2023-10-14 21:43:56,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1820658.0, ans=0.125 2023-10-14 21:44:15,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1820751.3333333333, ans=0.5 2023-10-14 21:44:21,563 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.18 vs. limit=15.0 2023-10-14 21:44:43,003 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1820844.6666666667, ans=10.0 2023-10-14 21:44:43,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 1.923e+02 2.075e+02 2.342e+02 3.631e+02, threshold=4.150e+02, percent-clipped=0.0 2023-10-14 21:44:46,055 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1820844.6666666667, ans=0.125 2023-10-14 21:44:52,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1820891.3333333333, ans=0.125 2023-10-14 21:44:57,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1820891.3333333333, ans=0.125 2023-10-14 21:45:33,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1821031.3333333333, ans=0.125 2023-10-14 21:45:37,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1821078.0, ans=0.125 2023-10-14 21:46:07,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1821171.3333333333, ans=0.1 2023-10-14 21:46:14,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1821218.0, ans=0.125 2023-10-14 21:46:24,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1821264.6666666667, ans=0.125 2023-10-14 21:46:32,534 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-14 21:46:37,806 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.775e+02 1.901e+02 2.129e+02 2.591e+02, threshold=3.801e+02, percent-clipped=0.0 2023-10-14 21:47:04,849 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1821404.6666666667, ans=0.1 2023-10-14 21:47:43,709 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.30 vs. limit=15.0 2023-10-14 21:47:45,750 INFO [train.py:1031] (0/4) Epoch 29, batch 8000, loss[loss=0.1827, simple_loss=0.2794, pruned_loss=0.04303, over 16614.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2773, pruned_loss=0.04594, over 32229065.97 frames. ], batch size: 66, lr: 1.18e-03, grad_scale: 16.0 2023-10-14 21:48:29,460 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1821778.0, ans=0.125 2023-10-14 21:48:29,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.760e+02 1.932e+02 2.230e+02 3.383e+02, threshold=3.864e+02, percent-clipped=0.0 2023-10-14 21:48:44,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1821824.6666666667, ans=0.125 2023-10-14 21:48:53,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1821871.3333333333, ans=0.125 2023-10-14 21:49:03,700 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-10-14 21:49:27,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1822011.3333333333, ans=0.05 2023-10-14 21:49:55,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1822151.3333333333, ans=0.125 2023-10-14 21:50:05,626 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1822198.0, ans=0.125 2023-10-14 21:50:07,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1822198.0, ans=0.2 2023-10-14 21:50:07,435 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1822198.0, ans=0.0 2023-10-14 21:50:07,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1822198.0, ans=0.125 2023-10-14 21:50:14,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1822244.6666666667, ans=0.125 2023-10-14 21:50:16,683 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.810e+02 1.954e+02 2.172e+02 2.963e+02, threshold=3.908e+02, percent-clipped=0.0 2023-10-14 21:50:17,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1822244.6666666667, ans=0.95 2023-10-14 21:50:39,981 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.45 vs. limit=6.0 2023-10-14 21:50:55,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1822338.0, ans=0.125 2023-10-14 21:51:01,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1822384.6666666667, ans=0.125 2023-10-14 21:51:04,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1822384.6666666667, ans=0.2 2023-10-14 21:51:09,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.09 vs. limit=15.0 2023-10-14 21:51:21,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1822431.3333333333, ans=0.0 2023-10-14 21:51:34,227 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.82 vs. limit=15.0 2023-10-14 21:51:40,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1822524.6666666667, ans=0.5 2023-10-14 21:51:42,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1822524.6666666667, ans=0.125 2023-10-14 21:51:44,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1822524.6666666667, ans=0.125 2023-10-14 21:51:50,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1822571.3333333333, ans=0.125 2023-10-14 21:52:00,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1822618.0, ans=0.125 2023-10-14 21:52:12,451 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-14 21:52:23,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.833e+02 1.986e+02 2.159e+02 3.578e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-14 21:52:29,081 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.21 vs. limit=15.0 2023-10-14 21:52:51,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1822804.6666666667, ans=0.125 2023-10-14 21:52:51,641 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1822804.6666666667, ans=0.1 2023-10-14 21:52:52,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=5.57 vs. limit=15.0 2023-10-14 21:53:15,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1822944.6666666667, ans=0.125 2023-10-14 21:53:20,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1822944.6666666667, ans=0.125 2023-10-14 21:53:39,957 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1823038.0, ans=0.0 2023-10-14 21:54:01,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1823084.6666666667, ans=0.125 2023-10-14 21:54:17,331 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.910e+02 2.081e+02 2.413e+02 3.419e+02, threshold=4.162e+02, percent-clipped=0.0 2023-10-14 21:54:27,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1823224.6666666667, ans=0.125 2023-10-14 21:54:50,047 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1823318.0, ans=0.125 2023-10-14 21:55:02,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1823364.6666666667, ans=0.1 2023-10-14 21:55:02,878 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1823364.6666666667, ans=0.125 2023-10-14 21:55:13,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=15.0 2023-10-14 21:55:28,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1823458.0, ans=0.0 2023-10-14 21:55:50,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1823551.3333333333, ans=0.0 2023-10-14 21:56:13,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.876e+02 2.016e+02 2.193e+02 2.736e+02, threshold=4.032e+02, percent-clipped=0.0 2023-10-14 21:56:20,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1823691.3333333333, ans=0.125 2023-10-14 21:56:21,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1823691.3333333333, ans=0.125 2023-10-14 21:56:33,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=12.0 2023-10-14 21:56:35,457 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-10-14 21:57:12,937 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.51 vs. limit=15.0 2023-10-14 21:57:18,855 INFO [train.py:1031] (0/4) Epoch 29, batch 8500, loss[loss=0.1807, simple_loss=0.2726, pruned_loss=0.04445, over 16673.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2776, pruned_loss=0.04581, over 32366809.95 frames. ], batch size: 61, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 21:57:31,177 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.38 vs. limit=12.0 2023-10-14 21:57:32,577 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1823971.3333333333, ans=0.125 2023-10-14 21:57:41,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1824018.0, ans=0.0 2023-10-14 21:57:53,832 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1824064.6666666667, ans=10.0 2023-10-14 21:57:55,682 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1824064.6666666667, ans=0.125 2023-10-14 21:58:04,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.898e+02 2.013e+02 2.180e+02 2.810e+02, threshold=4.026e+02, percent-clipped=0.0 2023-10-14 21:58:08,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1824111.3333333333, ans=0.0 2023-10-14 21:58:41,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1824251.3333333333, ans=0.0 2023-10-14 21:59:02,770 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.06 vs. limit=15.0 2023-10-14 21:59:12,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1824391.3333333333, ans=0.025 2023-10-14 21:59:26,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1824391.3333333333, ans=0.2 2023-10-14 21:59:33,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1824438.0, ans=0.05 2023-10-14 21:59:50,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1824531.3333333333, ans=0.125 2023-10-14 21:59:55,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1824531.3333333333, ans=0.125 2023-10-14 21:59:59,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1824531.3333333333, ans=0.1 2023-10-14 22:00:04,242 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.851e+02 2.037e+02 2.320e+02 3.265e+02, threshold=4.075e+02, percent-clipped=0.0 2023-10-14 22:00:05,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1824578.0, ans=0.1 2023-10-14 22:00:11,036 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.96 vs. limit=15.0 2023-10-14 22:00:12,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-10-14 22:00:13,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.27 vs. limit=22.5 2023-10-14 22:00:49,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1824764.6666666667, ans=0.125 2023-10-14 22:01:30,338 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1824904.6666666667, ans=0.125 2023-10-14 22:01:34,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1824904.6666666667, ans=0.0 2023-10-14 22:01:46,126 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:01:57,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1824998.0, ans=0.0 2023-10-14 22:02:04,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.803e+02 1.976e+02 2.193e+02 2.931e+02, threshold=3.953e+02, percent-clipped=0.0 2023-10-14 22:02:05,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1825044.6666666667, ans=0.0 2023-10-14 22:02:38,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1825138.0, ans=0.125 2023-10-14 22:02:52,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1825231.3333333333, ans=0.125 2023-10-14 22:02:55,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1825231.3333333333, ans=0.1 2023-10-14 22:03:10,750 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-10-14 22:03:14,753 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-10-14 22:03:16,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1825324.6666666667, ans=0.125 2023-10-14 22:03:21,105 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1825324.6666666667, ans=0.125 2023-10-14 22:03:57,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.767e+02 2.019e+02 2.182e+02 2.934e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-14 22:03:57,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1825511.3333333333, ans=0.0 2023-10-14 22:04:08,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1825558.0, ans=0.0 2023-10-14 22:04:12,230 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.36 vs. limit=15.0 2023-10-14 22:04:20,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1825604.6666666667, ans=0.025 2023-10-14 22:04:22,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1825604.6666666667, ans=0.0 2023-10-14 22:05:23,672 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-10-14 22:05:28,162 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-10-14 22:05:40,648 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=7.647e-03 2023-10-14 22:05:41,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1825931.3333333333, ans=0.125 2023-10-14 22:05:43,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1825978.0, ans=0.125 2023-10-14 22:05:46,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.642e+02 1.912e+02 2.083e+02 2.341e+02 3.545e+02, threshold=4.166e+02, percent-clipped=0.0 2023-10-14 22:05:57,814 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:05:58,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1826024.6666666667, ans=0.07 2023-10-14 22:05:59,891 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-10-14 22:06:02,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1826024.6666666667, ans=0.0 2023-10-14 22:06:18,042 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1826118.0, ans=0.0 2023-10-14 22:06:32,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826164.6666666667, ans=0.1 2023-10-14 22:06:47,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1826211.3333333333, ans=0.125 2023-10-14 22:06:49,874 INFO [train.py:1031] (0/4) Epoch 29, batch 9000, loss[loss=0.222, simple_loss=0.3134, pruned_loss=0.06527, over 16622.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.277, pruned_loss=0.04568, over 32474593.94 frames. ], batch size: 66, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 22:06:53,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1826258.0, ans=0.0 2023-10-14 22:06:59,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1826304.6666666667, ans=0.125 2023-10-14 22:07:01,585 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826304.6666666667, ans=0.1 2023-10-14 22:07:21,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1826398.0, ans=0.2 2023-10-14 22:07:27,467 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1826398.0, ans=0.2 2023-10-14 22:07:33,103 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1826444.6666666667, ans=0.125 2023-10-14 22:07:33,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.819e+02 2.009e+02 2.214e+02 3.178e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-14 22:07:37,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1826444.6666666667, ans=0.125 2023-10-14 22:08:10,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1826584.6666666667, ans=0.02 2023-10-14 22:08:22,215 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826631.3333333333, ans=0.1 2023-10-14 22:08:23,187 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1826678.0, ans=0.125 2023-10-14 22:08:23,226 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1826678.0, ans=0.1 2023-10-14 22:08:31,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1826678.0, ans=0.0 2023-10-14 22:09:00,236 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.17 vs. limit=22.5 2023-10-14 22:09:09,339 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1826864.6666666667, ans=0.0 2023-10-14 22:09:19,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.805e+02 2.014e+02 2.223e+02 2.672e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-14 22:09:23,056 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.03 vs. limit=6.0 2023-10-14 22:09:44,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1827004.6666666667, ans=0.125 2023-10-14 22:09:45,001 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.61 vs. limit=15.0 2023-10-14 22:09:51,089 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.60 vs. limit=15.0 2023-10-14 22:10:06,232 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1827098.0, ans=0.07 2023-10-14 22:10:09,229 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-10-14 22:10:47,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1827284.6666666667, ans=0.125 2023-10-14 22:11:02,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1827378.0, ans=0.1 2023-10-14 22:11:04,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.920e+02 2.063e+02 2.269e+02 4.021e+02, threshold=4.126e+02, percent-clipped=0.0 2023-10-14 22:11:21,277 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1827424.6666666667, ans=0.125 2023-10-14 22:11:26,274 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1827471.3333333333, ans=0.0 2023-10-14 22:11:39,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1827518.0, ans=0.05 2023-10-14 22:11:43,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1827518.0, ans=0.125 2023-10-14 22:11:53,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.71 vs. limit=15.0 2023-10-14 22:12:18,154 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.77 vs. limit=15.0 2023-10-14 22:12:33,554 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1827751.3333333333, ans=0.0 2023-10-14 22:12:41,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1827798.0, ans=0.2 2023-10-14 22:12:58,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.986e+02 2.202e+02 2.442e+02 3.242e+02, threshold=4.403e+02, percent-clipped=0.0 2023-10-14 22:13:01,081 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1827844.6666666667, ans=0.125 2023-10-14 22:13:21,972 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=1827938.0, ans=0.5 2023-10-14 22:13:37,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1827984.6666666667, ans=0.0 2023-10-14 22:13:44,788 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.25 vs. limit=15.0 2023-10-14 22:13:45,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1828031.3333333333, ans=0.0 2023-10-14 22:13:56,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1828078.0, ans=0.0 2023-10-14 22:13:56,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1828078.0, ans=0.0 2023-10-14 22:14:11,451 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1828124.6666666667, ans=0.0 2023-10-14 22:14:26,023 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1828171.3333333333, ans=0.07 2023-10-14 22:14:45,628 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:14:53,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1828311.3333333333, ans=0.0 2023-10-14 22:14:54,639 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1828311.3333333333, ans=0.2 2023-10-14 22:14:58,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.889e+02 2.072e+02 2.295e+02 3.905e+02, threshold=4.144e+02, percent-clipped=0.0 2023-10-14 22:15:05,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1828358.0, ans=0.125 2023-10-14 22:15:10,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1828358.0, ans=0.04949747468305833 2023-10-14 22:15:19,015 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.12 vs. limit=15.0 2023-10-14 22:15:19,799 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-10-14 22:15:20,677 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1828404.6666666667, ans=0.0 2023-10-14 22:15:23,098 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.19 vs. limit=15.0 2023-10-14 22:15:28,918 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.30 vs. limit=15.0 2023-10-14 22:15:49,450 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.78 vs. limit=15.0 2023-10-14 22:15:56,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1828544.6666666667, ans=0.0 2023-10-14 22:16:01,437 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-10-14 22:16:02,880 INFO [train.py:1031] (0/4) Epoch 29, batch 9500, loss[loss=0.1976, simple_loss=0.281, pruned_loss=0.05713, over 15812.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2776, pruned_loss=0.04597, over 32512834.50 frames. ], batch size: 35, lr: 1.18e-03, grad_scale: 16.0 2023-10-14 22:16:06,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=1828591.3333333333, ans=0.2 2023-10-14 22:16:09,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1828591.3333333333, ans=0.95 2023-10-14 22:16:12,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1828591.3333333333, ans=0.125 2023-10-14 22:16:12,565 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1828591.3333333333, ans=0.125 2023-10-14 22:16:29,585 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.91 vs. limit=15.0 2023-10-14 22:16:52,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.854e+02 2.043e+02 2.264e+02 3.097e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-14 22:16:58,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1828824.6666666667, ans=0.2 2023-10-14 22:16:58,280 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1828824.6666666667, ans=0.125 2023-10-14 22:17:08,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1828824.6666666667, ans=0.0 2023-10-14 22:17:12,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1828871.3333333333, ans=0.125 2023-10-14 22:17:24,680 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1828918.0, ans=0.125 2023-10-14 22:17:40,627 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-10-14 22:17:54,479 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=22.5 2023-10-14 22:18:06,824 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1829104.6666666667, ans=0.125 2023-10-14 22:18:21,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1829151.3333333333, ans=0.125 2023-10-14 22:18:48,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.848e+02 2.037e+02 2.256e+02 4.855e+02, threshold=4.073e+02, percent-clipped=1.0 2023-10-14 22:18:52,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1829291.3333333333, ans=0.0 2023-10-14 22:19:02,325 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-392000.pt 2023-10-14 22:19:09,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1829338.0, ans=0.0 2023-10-14 22:19:12,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1829338.0, ans=0.1 2023-10-14 22:19:33,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1829431.3333333333, ans=0.125 2023-10-14 22:20:03,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.96 vs. limit=22.5 2023-10-14 22:20:16,829 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1829618.0, ans=0.125 2023-10-14 22:20:20,905 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1829618.0, ans=0.125 2023-10-14 22:20:39,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.822e+02 2.041e+02 2.250e+02 3.166e+02, threshold=4.082e+02, percent-clipped=0.0 2023-10-14 22:20:50,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1829758.0, ans=0.0 2023-10-14 22:21:12,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1829851.3333333333, ans=0.1 2023-10-14 22:21:43,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1829991.3333333333, ans=0.1 2023-10-14 22:21:50,820 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.36 vs. limit=15.0 2023-10-14 22:22:12,378 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1830084.6666666667, ans=0.125 2023-10-14 22:22:18,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1830131.3333333333, ans=0.125 2023-10-14 22:22:25,071 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1830178.0, ans=0.5 2023-10-14 22:22:27,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1830178.0, ans=0.125 2023-10-14 22:22:31,113 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.818e+02 2.024e+02 2.308e+02 3.372e+02, threshold=4.048e+02, percent-clipped=0.0 2023-10-14 22:22:48,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1830271.3333333333, ans=0.125 2023-10-14 22:22:53,144 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1830271.3333333333, ans=0.2 2023-10-14 22:23:02,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1830318.0, ans=0.125 2023-10-14 22:23:02,241 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1830318.0, ans=0.0 2023-10-14 22:23:14,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1830364.6666666667, ans=0.1 2023-10-14 22:23:16,750 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1830364.6666666667, ans=0.2 2023-10-14 22:23:34,023 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.59 vs. limit=22.5 2023-10-14 22:23:37,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1830458.0, ans=10.0 2023-10-14 22:24:10,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1830598.0, ans=0.1 2023-10-14 22:24:11,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1830598.0, ans=0.1 2023-10-14 22:24:19,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.851e+02 1.996e+02 2.210e+02 3.664e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-14 22:24:47,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.89 vs. limit=10.0 2023-10-14 22:24:51,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1830784.6666666667, ans=0.0 2023-10-14 22:24:57,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1830831.3333333333, ans=0.0 2023-10-14 22:25:04,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1830831.3333333333, ans=0.125 2023-10-14 22:25:05,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1830831.3333333333, ans=0.0 2023-10-14 22:25:13,792 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.40 vs. limit=22.5 2023-10-14 22:25:17,781 INFO [train.py:1031] (0/4) Epoch 29, batch 10000, loss[loss=0.1813, simple_loss=0.268, pruned_loss=0.04725, over 16610.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2769, pruned_loss=0.04579, over 32572119.96 frames. ], batch size: 61, lr: 1.18e-03, grad_scale: 32.0 2023-10-14 22:25:19,462 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.19 vs. limit=15.0 2023-10-14 22:25:29,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1830971.3333333333, ans=0.125 2023-10-14 22:25:44,837 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-10-14 22:26:06,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.874e+02 2.050e+02 2.392e+02 3.302e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-14 22:26:12,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1831158.0, ans=0.125 2023-10-14 22:26:28,050 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831204.6666666667, ans=0.1 2023-10-14 22:27:09,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1831391.3333333333, ans=0.125 2023-10-14 22:27:37,693 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.58 vs. limit=22.5 2023-10-14 22:27:41,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1831531.3333333333, ans=0.125 2023-10-14 22:27:43,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1831531.3333333333, ans=0.0 2023-10-14 22:27:57,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.891e+02 2.046e+02 2.257e+02 2.931e+02, threshold=4.093e+02, percent-clipped=0.0 2023-10-14 22:28:41,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1831764.6666666667, ans=0.125 2023-10-14 22:29:08,710 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1831858.0, ans=0.2 2023-10-14 22:29:23,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1831904.6666666667, ans=0.1 2023-10-14 22:29:27,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831951.3333333333, ans=0.1 2023-10-14 22:29:44,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1831998.0, ans=0.0 2023-10-14 22:29:51,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.868e+02 2.041e+02 2.283e+02 3.467e+02, threshold=4.083e+02, percent-clipped=0.0 2023-10-14 22:29:59,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1832091.3333333333, ans=0.125 2023-10-14 22:30:06,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1832138.0, ans=0.2 2023-10-14 22:30:12,199 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1832138.0, ans=0.05 2023-10-14 22:30:45,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1832278.0, ans=0.125 2023-10-14 22:31:03,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1832324.6666666667, ans=0.125 2023-10-14 22:31:12,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1832371.3333333333, ans=0.125 2023-10-14 22:31:19,407 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1832418.0, ans=0.1 2023-10-14 22:31:40,566 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.03 vs. limit=22.5 2023-10-14 22:31:42,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1832511.3333333333, ans=0.125 2023-10-14 22:31:43,977 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.13 vs. limit=22.5 2023-10-14 22:31:46,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.824e+02 1.973e+02 2.191e+02 2.952e+02, threshold=3.947e+02, percent-clipped=0.0 2023-10-14 22:32:09,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1832604.6666666667, ans=0.05 2023-10-14 22:32:48,015 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1832744.6666666667, ans=0.125 2023-10-14 22:32:49,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1832744.6666666667, ans=0.125 2023-10-14 22:32:50,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1832744.6666666667, ans=0.0 2023-10-14 22:32:53,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1832791.3333333333, ans=0.2 2023-10-14 22:32:56,615 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=22.5 2023-10-14 22:32:56,664 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.40 vs. limit=10.0 2023-10-14 22:33:04,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1832838.0, ans=0.125 2023-10-14 22:33:05,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1832838.0, ans=0.0 2023-10-14 22:33:46,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.801e+02 1.923e+02 2.113e+02 2.761e+02, threshold=3.845e+02, percent-clipped=0.0 2023-10-14 22:34:03,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1833071.3333333333, ans=0.0 2023-10-14 22:34:24,299 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1833164.6666666667, ans=0.1 2023-10-14 22:34:29,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1833164.6666666667, ans=0.0 2023-10-14 22:34:34,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1833211.3333333333, ans=0.125 2023-10-14 22:34:37,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1833211.3333333333, ans=0.2 2023-10-14 22:34:45,898 INFO [train.py:1031] (0/4) Epoch 29, batch 10500, loss[loss=0.1821, simple_loss=0.279, pruned_loss=0.04253, over 16924.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2776, pruned_loss=0.04588, over 32646075.15 frames. ], batch size: 82, lr: 1.17e-03, grad_scale: 32.0 2023-10-14 22:34:46,371 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=12.0 2023-10-14 22:35:02,336 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1833304.6666666667, ans=0.125 2023-10-14 22:35:09,801 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1833351.3333333333, ans=0.0 2023-10-14 22:35:31,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.796e+02 1.987e+02 2.179e+02 2.839e+02, threshold=3.973e+02, percent-clipped=0.0 2023-10-14 22:35:41,540 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.83 vs. limit=15.0 2023-10-14 22:35:50,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1833538.0, ans=0.125 2023-10-14 22:36:01,450 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1833584.6666666667, ans=0.0 2023-10-14 22:36:18,788 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:36:26,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1833631.3333333333, ans=0.125 2023-10-14 22:36:26,692 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-10-14 22:36:36,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1833678.0, ans=0.1 2023-10-14 22:36:43,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1833724.6666666667, ans=0.2 2023-10-14 22:37:05,820 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-10-14 22:37:10,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1833818.0, ans=0.125 2023-10-14 22:37:19,506 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1833864.6666666667, ans=0.0 2023-10-14 22:37:21,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1833864.6666666667, ans=0.0 2023-10-14 22:37:24,350 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1833911.3333333333, ans=0.0 2023-10-14 22:37:30,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.875e+02 2.010e+02 2.189e+02 3.350e+02, threshold=4.021e+02, percent-clipped=0.0 2023-10-14 22:37:31,345 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:37:35,651 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1833958.0, ans=0.2 2023-10-14 22:38:11,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.85 vs. limit=22.5 2023-10-14 22:38:16,348 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.57 vs. limit=6.0 2023-10-14 22:38:33,079 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1834191.3333333333, ans=0.125 2023-10-14 22:38:51,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.65 vs. limit=6.0 2023-10-14 22:39:08,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1834331.3333333333, ans=10.0 2023-10-14 22:39:09,377 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1834331.3333333333, ans=0.125 2023-10-14 22:39:21,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.14 vs. limit=15.0 2023-10-14 22:39:25,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.878e+02 1.982e+02 2.239e+02 3.297e+02, threshold=3.964e+02, percent-clipped=0.0 2023-10-14 22:39:29,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1834378.0, ans=0.125 2023-10-14 22:39:30,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1834424.6666666667, ans=0.125 2023-10-14 22:39:31,886 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1834424.6666666667, ans=0.0 2023-10-14 22:40:00,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1834518.0, ans=0.04949747468305833 2023-10-14 22:40:05,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1834564.6666666667, ans=0.1 2023-10-14 22:40:32,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834658.0, ans=0.1 2023-10-14 22:40:33,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1834658.0, ans=0.0 2023-10-14 22:40:50,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1834751.3333333333, ans=0.0 2023-10-14 22:40:55,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1834751.3333333333, ans=0.125 2023-10-14 22:41:01,526 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:41:11,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1834844.6666666667, ans=0.0 2023-10-14 22:41:16,120 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.981e+02 2.115e+02 2.438e+02 2.977e+02, threshold=4.230e+02, percent-clipped=0.0 2023-10-14 22:41:32,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834938.0, ans=0.1 2023-10-14 22:42:00,179 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.56 vs. limit=15.0 2023-10-14 22:42:02,639 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.91 vs. limit=15.0 2023-10-14 22:42:05,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.45 vs. limit=15.0 2023-10-14 22:42:08,934 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.67 vs. limit=6.0 2023-10-14 22:42:11,528 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.16 vs. limit=15.0 2023-10-14 22:42:31,121 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1835171.3333333333, ans=0.0 2023-10-14 22:42:33,768 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1835171.3333333333, ans=0.1 2023-10-14 22:42:34,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1835171.3333333333, ans=0.125 2023-10-14 22:42:37,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1835171.3333333333, ans=0.035 2023-10-14 22:42:41,064 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1835218.0, ans=0.125 2023-10-14 22:42:48,412 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1835218.0, ans=0.0 2023-10-14 22:42:51,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=15.0 2023-10-14 22:42:52,140 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1835264.6666666667, ans=0.1 2023-10-14 22:42:58,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1835264.6666666667, ans=0.1 2023-10-14 22:43:10,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.771e+02 1.861e+02 2.051e+02 3.057e+02, threshold=3.722e+02, percent-clipped=0.0 2023-10-14 22:43:17,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1835358.0, ans=0.2 2023-10-14 22:43:32,549 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1835404.6666666667, ans=0.1 2023-10-14 22:43:33,403 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1835404.6666666667, ans=0.0 2023-10-14 22:43:33,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1835404.6666666667, ans=0.1 2023-10-14 22:43:37,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1835451.3333333333, ans=0.125 2023-10-14 22:43:49,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1835498.0, ans=0.1 2023-10-14 22:44:06,267 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=12.0 2023-10-14 22:44:10,719 INFO [train.py:1031] (0/4) Epoch 29, batch 11000, loss[loss=0.2019, simple_loss=0.2908, pruned_loss=0.05651, over 16004.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2775, pruned_loss=0.04604, over 32669552.16 frames. ], batch size: 296, lr: 1.17e-03, grad_scale: 32.0 2023-10-14 22:44:18,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1835591.3333333333, ans=0.125 2023-10-14 22:44:18,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1835591.3333333333, ans=0.125 2023-10-14 22:44:26,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1835638.0, ans=0.0 2023-10-14 22:44:27,792 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1835638.0, ans=0.09899494936611666 2023-10-14 22:44:40,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1835684.6666666667, ans=0.0 2023-10-14 22:44:44,658 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.16 vs. limit=15.0 2023-10-14 22:44:55,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1835778.0, ans=0.2 2023-10-14 22:44:55,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1835778.0, ans=0.0 2023-10-14 22:45:02,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.933e+02 2.132e+02 2.407e+02 3.591e+02, threshold=4.264e+02, percent-clipped=0.0 2023-10-14 22:45:06,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1835824.6666666667, ans=0.125 2023-10-14 22:45:24,284 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-10-14 22:45:29,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1835871.3333333333, ans=0.0 2023-10-14 22:45:38,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.80 vs. limit=15.0 2023-10-14 22:45:49,424 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.88 vs. limit=15.0 2023-10-14 22:46:21,330 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1836058.0, ans=0.125 2023-10-14 22:46:23,240 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:46:34,521 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1836151.3333333333, ans=0.125 2023-10-14 22:46:37,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1836151.3333333333, ans=0.125 2023-10-14 22:46:48,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1836198.0, ans=0.09899494936611666 2023-10-14 22:46:51,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1836198.0, ans=0.0 2023-10-14 22:46:58,567 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1836198.0, ans=0.125 2023-10-14 22:47:08,046 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.723e+02 1.894e+02 2.074e+02 3.359e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-14 22:47:13,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1836291.3333333333, ans=0.125 2023-10-14 22:47:33,583 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1836384.6666666667, ans=0.125 2023-10-14 22:47:40,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1836384.6666666667, ans=0.0 2023-10-14 22:47:42,872 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-10-14 22:48:06,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1836524.6666666667, ans=0.1 2023-10-14 22:48:16,963 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1836571.3333333333, ans=0.125 2023-10-14 22:48:36,881 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=15.0 2023-10-14 22:48:38,381 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1836618.0, ans=0.1 2023-10-14 22:48:44,635 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.31 vs. limit=15.0 2023-10-14 22:48:46,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1836664.6666666667, ans=0.025 2023-10-14 22:48:51,495 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.65 vs. limit=22.5 2023-10-14 22:48:56,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.861e+02 2.064e+02 2.307e+02 3.526e+02, threshold=4.128e+02, percent-clipped=0.0 2023-10-14 22:49:25,404 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:49:34,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1836851.3333333333, ans=0.125 2023-10-14 22:49:39,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1836898.0, ans=0.125 2023-10-14 22:49:55,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1836944.6666666667, ans=0.0 2023-10-14 22:49:57,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1836944.6666666667, ans=0.125 2023-10-14 22:50:12,662 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.78 vs. limit=10.0 2023-10-14 22:50:17,360 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1837038.0, ans=0.125 2023-10-14 22:50:53,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.445e+02 1.929e+02 2.185e+02 2.419e+02 3.198e+02, threshold=4.369e+02, percent-clipped=0.0 2023-10-14 22:51:13,100 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:51:32,789 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1837364.6666666667, ans=0.0 2023-10-14 22:51:37,288 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1837364.6666666667, ans=0.125 2023-10-14 22:51:37,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1837364.6666666667, ans=0.1 2023-10-14 22:51:43,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1837411.3333333333, ans=0.125 2023-10-14 22:52:12,248 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1837551.3333333333, ans=0.5 2023-10-14 22:52:41,142 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1837644.6666666667, ans=0.0 2023-10-14 22:52:45,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.877e+02 2.021e+02 2.337e+02 3.056e+02, threshold=4.043e+02, percent-clipped=0.0 2023-10-14 22:52:55,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1837691.3333333333, ans=0.125 2023-10-14 22:53:08,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1837738.0, ans=0.2 2023-10-14 22:53:39,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1837878.0, ans=0.125 2023-10-14 22:53:42,757 INFO [train.py:1031] (0/4) Epoch 29, batch 11500, loss[loss=0.1929, simple_loss=0.2872, pruned_loss=0.0493, over 16906.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2772, pruned_loss=0.04601, over 32699144.79 frames. ], batch size: 138, lr: 1.17e-03, grad_scale: 16.0 2023-10-14 22:53:44,997 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1837924.6666666667, ans=0.0 2023-10-14 22:53:52,010 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=1837971.3333333333, ans=0.2 2023-10-14 22:54:01,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1837971.3333333333, ans=0.125 2023-10-14 22:54:07,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1838018.0, ans=0.0 2023-10-14 22:54:19,448 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1838064.6666666667, ans=0.035 2023-10-14 22:54:34,616 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.861e+02 2.058e+02 2.262e+02 3.439e+02, threshold=4.116e+02, percent-clipped=0.0 2023-10-14 22:54:50,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1838204.6666666667, ans=0.125 2023-10-14 22:54:54,355 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1838204.6666666667, ans=0.1 2023-10-14 22:54:55,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1838204.6666666667, ans=0.0 2023-10-14 22:55:05,569 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1838251.3333333333, ans=0.1 2023-10-14 22:55:15,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1838298.0, ans=0.125 2023-10-14 22:55:24,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1838298.0, ans=0.1 2023-10-14 22:55:39,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1838391.3333333333, ans=0.125 2023-10-14 22:55:51,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1838438.0, ans=0.125 2023-10-14 22:56:12,716 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1838531.3333333333, ans=0.125 2023-10-14 22:56:16,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1838531.3333333333, ans=0.125 2023-10-14 22:56:16,434 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:56:22,304 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-10-14 22:56:30,363 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.33 vs. limit=15.0 2023-10-14 22:56:30,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.798e+02 1.920e+02 2.174e+02 3.078e+02, threshold=3.840e+02, percent-clipped=0.0 2023-10-14 22:56:43,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1838624.6666666667, ans=22.5 2023-10-14 22:56:46,794 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1838671.3333333333, ans=0.125 2023-10-14 22:56:58,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1838718.0, ans=0.125 2023-10-14 22:57:10,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1838764.6666666667, ans=0.0 2023-10-14 22:57:33,796 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1838858.0, ans=0.0 2023-10-14 22:57:38,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1838904.6666666667, ans=0.125 2023-10-14 22:57:44,526 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-14 22:57:55,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1838951.3333333333, ans=0.125 2023-10-14 22:58:14,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1839044.6666666667, ans=0.2 2023-10-14 22:58:16,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1839044.6666666667, ans=0.125 2023-10-14 22:58:18,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.843e+02 2.032e+02 2.316e+02 3.006e+02, threshold=4.064e+02, percent-clipped=0.0 2023-10-14 22:58:24,672 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1839091.3333333333, ans=0.0 2023-10-14 22:58:29,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=12.0 2023-10-14 22:59:21,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1839278.0, ans=0.04949747468305833 2023-10-14 22:59:49,597 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1839371.3333333333, ans=0.1 2023-10-14 23:00:00,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1839418.0, ans=0.1 2023-10-14 23:00:06,465 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1839464.6666666667, ans=0.125 2023-10-14 23:00:15,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1839464.6666666667, ans=0.125 2023-10-14 23:00:25,501 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.823e+02 1.942e+02 2.123e+02 2.607e+02, threshold=3.883e+02, percent-clipped=0.0 2023-10-14 23:00:36,514 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-10-14 23:00:45,719 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1839604.6666666667, ans=0.1 2023-10-14 23:00:47,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1839604.6666666667, ans=0.1 2023-10-14 23:00:48,950 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1839604.6666666667, ans=0.125 2023-10-14 23:01:00,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1839651.3333333333, ans=0.125 2023-10-14 23:01:07,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1839698.0, ans=0.0 2023-10-14 23:01:10,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1839698.0, ans=0.1 2023-10-14 23:01:14,151 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1839698.0, ans=0.125 2023-10-14 23:01:21,712 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1839744.6666666667, ans=0.125 2023-10-14 23:01:50,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1839884.6666666667, ans=0.125 2023-10-14 23:01:58,262 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1839931.3333333333, ans=0.125 2023-10-14 23:02:06,667 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1839931.3333333333, ans=0.125 2023-10-14 23:02:15,091 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.52 vs. limit=15.0 2023-10-14 23:02:18,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.815e+02 1.986e+02 2.186e+02 3.302e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-14 23:02:19,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1839978.0, ans=0.1 2023-10-14 23:02:22,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1840024.6666666667, ans=0.125 2023-10-14 23:02:23,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1840024.6666666667, ans=0.125 2023-10-14 23:02:42,473 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.87 vs. limit=15.0 2023-10-14 23:02:50,342 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-10-14 23:03:05,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1840211.3333333333, ans=0.125 2023-10-14 23:03:10,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1840211.3333333333, ans=10.0 2023-10-14 23:03:14,485 INFO [train.py:1031] (0/4) Epoch 29, batch 12000, loss[loss=0.1735, simple_loss=0.2765, pruned_loss=0.03526, over 16925.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2773, pruned_loss=0.04573, over 32732157.48 frames. ], batch size: 93, lr: 1.17e-03, grad_scale: 32.0 2023-10-14 23:03:26,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1840304.6666666667, ans=0.0 2023-10-14 23:03:43,823 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1840351.3333333333, ans=0.0 2023-10-14 23:03:47,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1840398.0, ans=0.0 2023-10-14 23:04:00,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1840444.6666666667, ans=0.125 2023-10-14 23:04:07,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 1.935e+02 2.157e+02 2.475e+02 3.360e+02, threshold=4.313e+02, percent-clipped=0.0 2023-10-14 23:04:09,475 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.33 vs. limit=10.0 2023-10-14 23:04:12,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1840491.3333333333, ans=0.125 2023-10-14 23:04:12,920 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:04:26,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1840538.0, ans=0.0 2023-10-14 23:04:32,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1840538.0, ans=0.125 2023-10-14 23:04:38,471 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2023-10-14 23:04:49,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-10-14 23:04:53,433 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.20 vs. limit=6.0 2023-10-14 23:05:02,512 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:05:21,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1840771.3333333333, ans=0.125 2023-10-14 23:05:26,314 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.40 vs. limit=15.0 2023-10-14 23:05:34,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1840818.0, ans=0.2 2023-10-14 23:05:35,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1840818.0, ans=0.125 2023-10-14 23:05:43,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.58 vs. limit=15.0 2023-10-14 23:05:55,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.765e+02 1.960e+02 2.105e+02 3.597e+02, threshold=3.920e+02, percent-clipped=0.0 2023-10-14 23:06:10,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1841004.6666666667, ans=0.1 2023-10-14 23:06:18,969 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.47 vs. limit=22.5 2023-10-14 23:06:31,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1841098.0, ans=0.125 2023-10-14 23:06:34,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1841098.0, ans=0.125 2023-10-14 23:06:46,492 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=15.0 2023-10-14 23:07:04,419 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1841238.0, ans=0.125 2023-10-14 23:07:12,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1841284.6666666667, ans=0.125 2023-10-14 23:07:17,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1841284.6666666667, ans=0.1 2023-10-14 23:07:41,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.871e+02 2.032e+02 2.252e+02 4.868e+02, threshold=4.064e+02, percent-clipped=1.0 2023-10-14 23:07:43,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1841424.6666666667, ans=0.125 2023-10-14 23:07:43,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1841424.6666666667, ans=0.1 2023-10-14 23:08:04,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1841518.0, ans=0.125 2023-10-14 23:08:12,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1841518.0, ans=0.0 2023-10-14 23:08:31,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1841611.3333333333, ans=0.2 2023-10-14 23:08:34,745 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1841611.3333333333, ans=0.1 2023-10-14 23:08:44,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1841658.0, ans=0.125 2023-10-14 23:08:57,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1841704.6666666667, ans=0.1 2023-10-14 23:09:06,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1841751.3333333333, ans=0.125 2023-10-14 23:09:13,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1841798.0, ans=0.125 2023-10-14 23:09:31,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.877e+02 2.083e+02 2.351e+02 3.267e+02, threshold=4.166e+02, percent-clipped=0.0 2023-10-14 23:09:59,919 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.94 vs. limit=15.0 2023-10-14 23:10:12,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1842031.3333333333, ans=0.0 2023-10-14 23:10:25,845 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-10-14 23:10:33,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1842124.6666666667, ans=0.2 2023-10-14 23:10:42,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1842171.3333333333, ans=0.125 2023-10-14 23:10:44,508 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1842171.3333333333, ans=10.0 2023-10-14 23:10:47,853 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1842171.3333333333, ans=0.0 2023-10-14 23:10:47,958 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1842171.3333333333, ans=0.125 2023-10-14 23:10:49,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1842171.3333333333, ans=0.125 2023-10-14 23:10:56,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1842218.0, ans=0.1 2023-10-14 23:11:06,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1842264.6666666667, ans=0.125 2023-10-14 23:11:25,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.953e+02 2.145e+02 2.374e+02 3.106e+02, threshold=4.291e+02, percent-clipped=0.0 2023-10-14 23:11:27,508 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.32 vs. limit=15.0 2023-10-14 23:11:27,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1842358.0, ans=22.5 2023-10-14 23:11:41,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1842404.6666666667, ans=0.1 2023-10-14 23:11:46,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1842404.6666666667, ans=0.1 2023-10-14 23:12:10,967 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1842498.0, ans=0.1 2023-10-14 23:12:15,193 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1842544.6666666667, ans=0.125 2023-10-14 23:12:19,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=1842544.6666666667, ans=0.025 2023-10-14 23:12:23,816 INFO [train.py:1031] (0/4) Epoch 29, batch 12500, loss[loss=0.1877, simple_loss=0.2841, pruned_loss=0.04562, over 16838.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.277, pruned_loss=0.04579, over 32741223.22 frames. ], batch size: 175, lr: 1.17e-03, grad_scale: 32.0 2023-10-14 23:12:25,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1842591.3333333333, ans=0.0 2023-10-14 23:12:30,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1842591.3333333333, ans=0.0 2023-10-14 23:12:38,733 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1842638.0, ans=0.2 2023-10-14 23:12:38,838 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1842638.0, ans=0.0 2023-10-14 23:12:42,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1842638.0, ans=0.0 2023-10-14 23:12:57,377 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-10-14 23:13:01,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1842731.3333333333, ans=0.04949747468305833 2023-10-14 23:13:03,195 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-10-14 23:13:11,543 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1842778.0, ans=0.125 2023-10-14 23:13:14,140 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.908e+02 2.054e+02 2.262e+02 3.234e+02, threshold=4.108e+02, percent-clipped=0.0 2023-10-14 23:13:24,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1842824.6666666667, ans=0.125 2023-10-14 23:13:54,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1842964.6666666667, ans=0.125 2023-10-14 23:14:07,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1843011.3333333333, ans=0.0 2023-10-14 23:14:08,911 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1843058.0, ans=0.125 2023-10-14 23:14:13,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-10-14 23:14:19,601 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843104.6666666667, ans=0.1 2023-10-14 23:14:20,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1843104.6666666667, ans=0.1 2023-10-14 23:14:33,893 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1843151.3333333333, ans=0.0 2023-10-14 23:14:37,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843151.3333333333, ans=0.1 2023-10-14 23:14:40,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1843151.3333333333, ans=0.0 2023-10-14 23:14:43,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1843198.0, ans=0.0 2023-10-14 23:14:46,175 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.51 vs. limit=12.0 2023-10-14 23:14:48,066 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.51 vs. limit=15.0 2023-10-14 23:14:53,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1843244.6666666667, ans=0.0 2023-10-14 23:14:54,213 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843244.6666666667, ans=0.1 2023-10-14 23:15:02,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.816e+02 1.996e+02 2.202e+02 2.932e+02, threshold=3.992e+02, percent-clipped=0.0 2023-10-14 23:16:15,551 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1843571.3333333333, ans=0.125 2023-10-14 23:16:37,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1843664.6666666667, ans=0.0 2023-10-14 23:16:50,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.786e+02 1.942e+02 2.228e+02 3.260e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-14 23:16:59,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1843758.0, ans=0.0 2023-10-14 23:17:01,319 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1843758.0, ans=0.125 2023-10-14 23:17:03,421 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.605e-03 2023-10-14 23:17:24,855 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.63 vs. limit=15.0 2023-10-14 23:17:28,774 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.84 vs. limit=15.0 2023-10-14 23:17:40,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1843944.6666666667, ans=0.125 2023-10-14 23:17:58,547 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.49 vs. limit=22.5 2023-10-14 23:18:36,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 1.887e+02 2.050e+02 2.182e+02 3.045e+02, threshold=4.099e+02, percent-clipped=0.0 2023-10-14 23:18:44,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1844224.6666666667, ans=0.125 2023-10-14 23:18:49,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1844224.6666666667, ans=0.125 2023-10-14 23:18:53,647 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1844271.3333333333, ans=0.0 2023-10-14 23:18:54,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1844271.3333333333, ans=0.2 2023-10-14 23:18:55,170 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.67 vs. limit=10.0 2023-10-14 23:19:01,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1844318.0, ans=0.1 2023-10-14 23:19:07,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1844318.0, ans=0.125 2023-10-14 23:19:09,732 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1844318.0, ans=0.2 2023-10-14 23:19:09,814 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1844318.0, ans=0.1 2023-10-14 23:19:41,214 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1844458.0, ans=10.0 2023-10-14 23:20:15,133 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1844598.0, ans=0.0 2023-10-14 23:20:25,006 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.28 vs. limit=6.0 2023-10-14 23:20:25,042 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-10-14 23:20:27,796 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.834e+02 1.966e+02 2.169e+02 2.783e+02, threshold=3.933e+02, percent-clipped=0.0 2023-10-14 23:20:28,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1844644.6666666667, ans=0.0 2023-10-14 23:20:43,772 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1844738.0, ans=0.125 2023-10-14 23:20:48,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1844738.0, ans=15.0 2023-10-14 23:20:49,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1844784.6666666667, ans=0.125 2023-10-14 23:20:55,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1844784.6666666667, ans=0.1 2023-10-14 23:20:56,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1844784.6666666667, ans=0.2 2023-10-14 23:21:00,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1844831.3333333333, ans=0.125 2023-10-14 23:21:10,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1844878.0, ans=0.0 2023-10-14 23:21:18,011 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.26 vs. limit=10.0 2023-10-14 23:21:20,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1844878.0, ans=0.125 2023-10-14 23:21:21,802 INFO [train.py:1031] (0/4) Epoch 29, batch 13000, loss[loss=0.1831, simple_loss=0.2699, pruned_loss=0.04812, over 16624.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2778, pruned_loss=0.046, over 32768911.62 frames. ], batch size: 66, lr: 1.17e-03, grad_scale: 32.0 2023-10-14 23:21:37,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1844971.3333333333, ans=0.1 2023-10-14 23:21:43,871 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:21:47,488 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1845018.0, ans=0.1 2023-10-14 23:21:55,263 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:21:59,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1845018.0, ans=0.2 2023-10-14 23:22:21,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.854e+02 2.027e+02 2.210e+02 2.850e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-14 23:22:27,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-10-14 23:22:36,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1845204.6666666667, ans=0.07 2023-10-14 23:22:42,158 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.51 vs. limit=15.0 2023-10-14 23:22:59,945 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1845251.3333333333, ans=0.1 2023-10-14 23:23:15,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1845344.6666666667, ans=0.125 2023-10-14 23:23:33,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1845438.0, ans=0.125 2023-10-14 23:23:43,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1845484.6666666667, ans=0.125 2023-10-14 23:23:53,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1845484.6666666667, ans=0.125 2023-10-14 23:23:59,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1845531.3333333333, ans=0.0 2023-10-14 23:24:15,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.796e+02 1.922e+02 2.112e+02 3.031e+02, threshold=3.844e+02, percent-clipped=0.0 2023-10-14 23:24:25,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1845624.6666666667, ans=0.0 2023-10-14 23:24:28,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=12.0 2023-10-14 23:24:39,722 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.29 vs. limit=10.0 2023-10-14 23:24:46,366 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-10-14 23:25:15,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1845811.3333333333, ans=0.0 2023-10-14 23:25:16,876 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1845858.0, ans=0.2 2023-10-14 23:25:16,964 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-10-14 23:25:22,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1845858.0, ans=0.0 2023-10-14 23:25:23,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1845858.0, ans=0.1 2023-10-14 23:25:24,607 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1845858.0, ans=0.125 2023-10-14 23:25:33,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1845904.6666666667, ans=0.125 2023-10-14 23:25:50,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1845998.0, ans=0.025 2023-10-14 23:26:03,255 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-10-14 23:26:03,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1846044.6666666667, ans=0.125 2023-10-14 23:26:06,337 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1846044.6666666667, ans=0.1 2023-10-14 23:26:08,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.773e+02 1.943e+02 2.081e+02 2.778e+02, threshold=3.885e+02, percent-clipped=0.0 2023-10-14 23:26:17,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1846091.3333333333, ans=0.1 2023-10-14 23:26:22,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1846138.0, ans=0.1 2023-10-14 23:26:26,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1846138.0, ans=0.1 2023-10-14 23:26:41,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1846184.6666666667, ans=0.125 2023-10-14 23:26:45,185 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-10-14 23:26:47,160 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1846231.3333333333, ans=0.0 2023-10-14 23:26:48,048 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1846231.3333333333, ans=0.125 2023-10-14 23:27:09,817 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1846324.6666666667, ans=0.0 2023-10-14 23:27:35,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1846418.0, ans=0.09899494936611666 2023-10-14 23:27:58,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 1.875e+02 2.033e+02 2.193e+02 3.109e+02, threshold=4.065e+02, percent-clipped=0.0 2023-10-14 23:28:09,183 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.96 vs. limit=15.0 2023-10-14 23:28:27,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1846651.3333333333, ans=0.0 2023-10-14 23:29:15,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1846838.0, ans=0.1 2023-10-14 23:29:16,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1846838.0, ans=0.0 2023-10-14 23:29:24,491 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1846884.6666666667, ans=0.0 2023-10-14 23:29:28,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1846884.6666666667, ans=0.125 2023-10-14 23:29:34,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1846931.3333333333, ans=0.125 2023-10-14 23:29:35,058 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1846931.3333333333, ans=0.07 2023-10-14 23:29:52,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.799e+02 1.994e+02 2.284e+02 3.347e+02, threshold=3.989e+02, percent-clipped=0.0 2023-10-14 23:30:15,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1847118.0, ans=0.125 2023-10-14 23:30:35,924 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:30:49,492 INFO [train.py:1031] (0/4) Epoch 29, batch 13500, loss[loss=0.1802, simple_loss=0.269, pruned_loss=0.04565, over 16069.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.277, pruned_loss=0.04562, over 32795558.67 frames. ], batch size: 296, lr: 1.17e-03, grad_scale: 16.0 2023-10-14 23:31:00,286 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:31:35,095 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1847444.6666666667, ans=0.0 2023-10-14 23:31:41,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1847444.6666666667, ans=0.125 2023-10-14 23:31:44,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1847444.6666666667, ans=0.2 2023-10-14 23:31:45,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.865e+02 2.062e+02 2.247e+02 3.122e+02, threshold=4.124e+02, percent-clipped=0.0 2023-10-14 23:31:58,469 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1847538.0, ans=0.125 2023-10-14 23:32:00,688 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.27 vs. limit=22.5 2023-10-14 23:32:08,479 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=15.0 2023-10-14 23:32:28,431 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1847631.3333333333, ans=0.1 2023-10-14 23:32:38,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1847678.0, ans=0.2 2023-10-14 23:32:55,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1847771.3333333333, ans=0.125 2023-10-14 23:32:59,292 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1847818.0, ans=0.0 2023-10-14 23:33:02,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1847818.0, ans=0.2 2023-10-14 23:33:14,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1847864.6666666667, ans=0.125 2023-10-14 23:33:25,870 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.908e+02 2.134e+02 2.335e+02 3.342e+02, threshold=4.268e+02, percent-clipped=0.0 2023-10-14 23:33:32,413 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-29.pt 2023-10-14 23:34:01,770 INFO [train.py:1031] (0/4) Epoch 30, batch 0, loss[loss=0.1517, simple_loss=0.2486, pruned_loss=0.02734, over 16900.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2486, pruned_loss=0.02734, over 16900.00 frames. ], batch size: 104, lr: 1.15e-03, grad_scale: 32.0 2023-10-14 23:34:01,771 INFO [train.py:1054] (0/4) Computing validation loss 2023-10-14 23:34:07,658 INFO [zipformer.py:1853] (0/4) name=encoder.encoders.3.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([1.7714, 2.0182, 1.9541, 2.2656, 2.1770, 2.2165, 2.3641, 1.7020], device='cuda:0') 2023-10-14 23:34:09,354 INFO [train.py:1063] (0/4) Epoch 30, validation: loss=0.2121, simple_loss=0.2987, pruned_loss=0.06271, over 1020973.00 frames. 2023-10-14 23:34:09,355 INFO [train.py:1064] (0/4) Maximum memory allocated so far is 17165MB 2023-10-14 23:34:28,571 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1848028.0, ans=0.0 2023-10-14 23:34:59,189 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-10-14 23:35:31,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1848308.0, ans=0.125 2023-10-14 23:35:34,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1848308.0, ans=0.125 2023-10-14 23:35:50,300 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1848354.6666666667, ans=0.125 2023-10-14 23:35:56,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.802e+02 1.972e+02 2.221e+02 2.697e+02, threshold=3.944e+02, percent-clipped=0.0 2023-10-14 23:36:00,560 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.67 vs. limit=15.0 2023-10-14 23:36:08,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1848448.0, ans=0.1 2023-10-14 23:36:28,018 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:36:43,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1848588.0, ans=0.05 2023-10-14 23:37:21,918 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1848774.6666666667, ans=0.125 2023-10-14 23:37:34,928 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1848821.3333333333, ans=0.125 2023-10-14 23:37:37,630 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1848821.3333333333, ans=0.125 2023-10-14 23:37:45,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.856e+02 1.984e+02 2.208e+02 2.663e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-14 23:37:47,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1848868.0, ans=0.125 2023-10-14 23:37:54,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1848914.6666666667, ans=0.0 2023-10-14 23:37:55,310 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=12.0 2023-10-14 23:38:46,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1849101.3333333333, ans=0.125 2023-10-14 23:38:47,828 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=12.0 2023-10-14 23:38:51,168 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.57 vs. limit=22.5 2023-10-14 23:39:25,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1849288.0, ans=0.0 2023-10-14 23:39:34,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1849334.6666666667, ans=0.0 2023-10-14 23:39:38,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.861e+02 2.020e+02 2.185e+02 2.894e+02, threshold=4.041e+02, percent-clipped=0.0 2023-10-14 23:39:51,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1849381.3333333333, ans=0.125 2023-10-14 23:39:53,368 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1849428.0, ans=0.04949747468305833 2023-10-14 23:39:55,632 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.19 vs. limit=15.0 2023-10-14 23:40:08,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1849474.6666666667, ans=0.125 2023-10-14 23:40:21,211 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1849521.3333333333, ans=0.0 2023-10-14 23:40:24,724 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1849521.3333333333, ans=0.0 2023-10-14 23:40:30,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1849568.0, ans=0.1 2023-10-14 23:40:53,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1849661.3333333333, ans=0.1 2023-10-14 23:41:02,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1849708.0, ans=0.125 2023-10-14 23:41:11,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1849754.6666666667, ans=0.125 2023-10-14 23:41:24,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.893e+02 2.014e+02 2.285e+02 3.360e+02, threshold=4.028e+02, percent-clipped=0.0 2023-10-14 23:41:25,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1849801.3333333333, ans=0.2 2023-10-14 23:41:42,828 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1849894.6666666667, ans=0.125 2023-10-14 23:41:50,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1849894.6666666667, ans=0.2 2023-10-14 23:42:03,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1849988.0, ans=0.125 2023-10-14 23:42:04,200 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.75 vs. limit=10.0 2023-10-14 23:42:06,332 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1849988.0, ans=0.0 2023-10-14 23:42:11,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1849988.0, ans=0.0 2023-10-14 23:42:33,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1850081.3333333333, ans=0.0 2023-10-14 23:42:33,873 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1850081.3333333333, ans=0.1 2023-10-14 23:42:37,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1850128.0, ans=0.0 2023-10-14 23:42:40,837 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1850128.0, ans=0.015 2023-10-14 23:43:04,511 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:43:05,599 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1850221.3333333333, ans=0.0 2023-10-14 23:43:07,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1850221.3333333333, ans=0.2 2023-10-14 23:43:09,368 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-10-14 23:43:17,970 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.887e+02 2.049e+02 2.316e+02 3.250e+02, threshold=4.097e+02, percent-clipped=0.0 2023-10-14 23:43:20,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1850268.0, ans=0.1 2023-10-14 23:43:23,076 INFO [train.py:1031] (0/4) Epoch 30, batch 500, loss[loss=0.179, simple_loss=0.273, pruned_loss=0.04246, over 16858.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2776, pruned_loss=0.04588, over 7291367.83 frames. ], batch size: 110, lr: 1.15e-03, grad_scale: 16.0 2023-10-14 23:43:35,930 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-10-14 23:43:45,410 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1850408.0, ans=12.0 2023-10-14 23:44:27,043 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2023-10-14 23:44:50,818 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1850688.0, ans=0.125 2023-10-14 23:45:02,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1850734.6666666667, ans=0.125 2023-10-14 23:45:04,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1850734.6666666667, ans=0.125 2023-10-14 23:45:05,908 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=15.0 2023-10-14 23:45:08,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.887e+02 2.080e+02 2.271e+02 2.941e+02, threshold=4.160e+02, percent-clipped=0.0 2023-10-14 23:45:11,635 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1850781.3333333333, ans=0.1 2023-10-14 23:45:45,063 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-10-14 23:45:54,061 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1850921.3333333333, ans=0.1 2023-10-14 23:46:14,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1851014.6666666667, ans=0.125 2023-10-14 23:46:19,736 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=15.0 2023-10-14 23:46:20,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1851061.3333333333, ans=0.125 2023-10-14 23:46:31,420 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:46:32,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1851108.0, ans=0.0 2023-10-14 23:46:36,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1851108.0, ans=0.1 2023-10-14 23:46:40,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=1851154.6666666667, ans=0.2 2023-10-14 23:46:56,297 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=12.0 2023-10-14 23:46:56,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.870e+02 2.052e+02 2.234e+02 3.246e+02, threshold=4.104e+02, percent-clipped=0.0 2023-10-14 23:47:00,627 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1851248.0, ans=0.1 2023-10-14 23:47:15,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1851294.6666666667, ans=0.2 2023-10-14 23:47:42,396 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.26 vs. limit=15.0 2023-10-14 23:47:57,783 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1851481.3333333333, ans=0.2 2023-10-14 23:47:58,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1851481.3333333333, ans=0.0 2023-10-14 23:48:02,033 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1851481.3333333333, ans=10.0 2023-10-14 23:48:06,125 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.57 vs. limit=15.0 2023-10-14 23:48:07,233 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1851528.0, ans=0.125 2023-10-14 23:48:14,795 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.44 vs. limit=15.0 2023-10-14 23:48:29,670 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.83 vs. limit=15.0 2023-10-14 23:48:46,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.881e+02 2.075e+02 2.337e+02 2.891e+02, threshold=4.151e+02, percent-clipped=0.0 2023-10-14 23:48:52,499 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=1851714.6666666667, ans=0.5 2023-10-14 23:49:15,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1851808.0, ans=0.125 2023-10-14 23:49:45,748 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-10-14 23:49:56,203 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-14 23:50:03,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.05 vs. limit=15.0 2023-10-14 23:50:03,885 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1851948.0, ans=0.125 2023-10-14 23:50:32,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1852088.0, ans=0.125 2023-10-14 23:50:34,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1852088.0, ans=0.2 2023-10-14 23:50:36,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1852088.0, ans=0.125 2023-10-14 23:50:39,481 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1852134.6666666667, ans=0.0 2023-10-14 23:50:40,552 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.16 vs. limit=15.0 2023-10-14 23:50:46,601 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 1.886e+02 2.080e+02 2.343e+02 2.900e+02, threshold=4.161e+02, percent-clipped=0.0 2023-10-14 23:50:52,933 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1852181.3333333333, ans=0.125 2023-10-14 23:50:53,807 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1852181.3333333333, ans=0.125 2023-10-14 23:50:59,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1852181.3333333333, ans=0.1 2023-10-14 23:51:05,865 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.17 vs. limit=15.0 2023-10-14 23:51:27,145 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1852321.3333333333, ans=0.125 2023-10-14 23:51:30,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-10-14 23:52:13,246 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1852508.0, ans=0.0 2023-10-14 23:52:41,839 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.884e+02 2.035e+02 2.238e+02 3.170e+02, threshold=4.071e+02, percent-clipped=0.0 2023-10-14 23:52:45,250 INFO [train.py:1031] (0/4) Epoch 30, batch 1000, loss[loss=0.1723, simple_loss=0.2733, pruned_loss=0.03571, over 16817.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.278, pruned_loss=0.04608, over 12950295.12 frames. ], batch size: 98, lr: 1.15e-03, grad_scale: 16.0 2023-10-14 23:53:08,659 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1852741.3333333333, ans=0.125 2023-10-14 23:53:13,702 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.65 vs. limit=15.0 2023-10-14 23:53:13,808 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.38 vs. limit=15.0 2023-10-14 23:53:15,326 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1852741.3333333333, ans=0.125 2023-10-14 23:53:15,652 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-10-14 23:53:58,913 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-10-14 23:54:27,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.829e+02 2.006e+02 2.205e+02 3.000e+02, threshold=4.011e+02, percent-clipped=0.0 2023-10-14 23:54:42,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1853161.3333333333, ans=0.0 2023-10-14 23:54:43,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1853161.3333333333, ans=0.125 2023-10-14 23:54:49,892 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1853161.3333333333, ans=0.125 2023-10-14 23:55:10,150 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.46 vs. limit=15.0 2023-10-14 23:55:28,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1853301.3333333333, ans=0.2 2023-10-14 23:55:31,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1853348.0, ans=0.2 2023-10-14 23:55:33,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1853348.0, ans=0.2 2023-10-14 23:55:47,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1853394.6666666667, ans=0.125 2023-10-14 23:56:01,689 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.42 vs. limit=15.0 2023-10-14 23:56:01,789 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.50 vs. limit=10.0 2023-10-14 23:56:31,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.802e+02 1.947e+02 2.164e+02 4.554e+02, threshold=3.894e+02, percent-clipped=1.0 2023-10-14 23:57:09,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.62 vs. limit=22.5 2023-10-14 23:57:20,533 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-10-14 23:57:22,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1853768.0, ans=0.125 2023-10-14 23:57:36,833 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1853814.6666666667, ans=0.015 2023-10-14 23:57:37,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1853814.6666666667, ans=0.1 2023-10-14 23:57:46,978 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1853861.3333333333, ans=0.125 2023-10-14 23:58:07,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1853954.6666666667, ans=0.0 2023-10-14 23:58:22,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1854001.3333333333, ans=0.0 2023-10-14 23:58:23,279 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.749e+02 1.871e+02 2.037e+02 2.473e+02, threshold=3.741e+02, percent-clipped=0.0 2023-10-14 23:58:32,297 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1854048.0, ans=0.0 2023-10-14 23:59:07,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1854234.6666666667, ans=0.0 2023-10-14 23:59:25,038 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1854281.3333333333, ans=10.0 2023-10-14 23:59:27,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1854281.3333333333, ans=0.0 2023-10-14 23:59:43,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1854374.6666666667, ans=0.125 2023-10-14 23:59:43,969 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-10-14 23:59:56,895 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:00:12,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.783e+02 1.971e+02 2.180e+02 3.057e+02, threshold=3.942e+02, percent-clipped=0.0 2023-10-15 00:00:13,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1854514.6666666667, ans=15.0 2023-10-15 00:00:27,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1854561.3333333333, ans=0.09899494936611666 2023-10-15 00:00:29,294 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:00:30,129 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1854561.3333333333, ans=0.125 2023-10-15 00:00:42,547 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1854608.0, ans=0.125 2023-10-15 00:00:44,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1854608.0, ans=0.125 2023-10-15 00:00:45,610 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1854654.6666666667, ans=0.2 2023-10-15 00:00:47,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1854654.6666666667, ans=0.125 2023-10-15 00:00:47,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1854654.6666666667, ans=0.1 2023-10-15 00:00:52,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1854654.6666666667, ans=0.125 2023-10-15 00:00:54,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1854654.6666666667, ans=0.2 2023-10-15 00:01:06,220 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1854701.3333333333, ans=0.1 2023-10-15 00:01:13,887 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.63 vs. limit=6.0 2023-10-15 00:01:32,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1854841.3333333333, ans=0.125 2023-10-15 00:01:34,314 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1854841.3333333333, ans=0.0 2023-10-15 00:01:35,063 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1854841.3333333333, ans=0.035 2023-10-15 00:01:39,331 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1854841.3333333333, ans=0.125 2023-10-15 00:01:45,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1854888.0, ans=0.04949747468305833 2023-10-15 00:01:54,227 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1854934.6666666667, ans=0.1 2023-10-15 00:02:04,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.828e+02 2.019e+02 2.227e+02 3.813e+02, threshold=4.037e+02, percent-clipped=0.0 2023-10-15 00:02:05,742 INFO [train.py:1031] (0/4) Epoch 30, batch 1500, loss[loss=0.2154, simple_loss=0.2955, pruned_loss=0.06763, over 15475.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2764, pruned_loss=0.04552, over 17328025.30 frames. ], batch size: 35, lr: 1.15e-03, grad_scale: 8.0 2023-10-15 00:02:06,782 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1854981.3333333333, ans=0.125 2023-10-15 00:02:12,436 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1854981.3333333333, ans=0.0 2023-10-15 00:02:42,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1855121.3333333333, ans=0.125 2023-10-15 00:03:04,959 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1855214.6666666667, ans=0.5 2023-10-15 00:03:07,115 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1855214.6666666667, ans=0.125 2023-10-15 00:03:07,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1855214.6666666667, ans=0.125 2023-10-15 00:03:23,666 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1855261.3333333333, ans=0.125 2023-10-15 00:03:37,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1855354.6666666667, ans=0.125 2023-10-15 00:03:51,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1855401.3333333333, ans=0.125 2023-10-15 00:03:58,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.870e+02 2.076e+02 2.405e+02 3.528e+02, threshold=4.152e+02, percent-clipped=0.0 2023-10-15 00:03:58,977 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1855448.0, ans=0.2 2023-10-15 00:04:08,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1855448.0, ans=0.0 2023-10-15 00:04:14,971 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.52 vs. limit=15.0 2023-10-15 00:04:20,889 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.25 vs. limit=22.5 2023-10-15 00:04:21,716 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.90 vs. limit=6.0 2023-10-15 00:04:41,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1855588.0, ans=0.2 2023-10-15 00:04:43,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1855588.0, ans=0.2 2023-10-15 00:04:54,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=1855634.6666666667, ans=0.95 2023-10-15 00:05:04,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1855681.3333333333, ans=0.125 2023-10-15 00:05:10,653 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-10-15 00:05:16,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1855728.0, ans=0.125 2023-10-15 00:05:24,636 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-10-15 00:05:39,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1855821.3333333333, ans=0.125 2023-10-15 00:05:40,486 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:05:41,629 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-10-15 00:05:53,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.868e+02 2.043e+02 2.325e+02 3.101e+02, threshold=4.085e+02, percent-clipped=0.0 2023-10-15 00:06:04,300 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:06:24,351 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.09 vs. limit=22.5 2023-10-15 00:06:50,243 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1856148.0, ans=0.125 2023-10-15 00:07:01,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1856194.6666666667, ans=0.125 2023-10-15 00:07:43,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.805e+02 1.963e+02 2.142e+02 3.150e+02, threshold=3.925e+02, percent-clipped=0.0 2023-10-15 00:07:46,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856381.3333333333, ans=0.1 2023-10-15 00:07:51,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1856381.3333333333, ans=0.0 2023-10-15 00:07:51,884 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.60 vs. limit=15.0 2023-10-15 00:08:01,465 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.97 vs. limit=10.0 2023-10-15 00:08:34,811 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856568.0, ans=0.1 2023-10-15 00:08:46,552 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1856614.6666666667, ans=0.125 2023-10-15 00:08:47,399 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856614.6666666667, ans=0.1 2023-10-15 00:08:47,505 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1856614.6666666667, ans=0.125 2023-10-15 00:08:57,236 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.76 vs. limit=15.0 2023-10-15 00:09:10,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1856708.0, ans=0.1 2023-10-15 00:09:18,051 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1856754.6666666667, ans=0.0 2023-10-15 00:09:22,107 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1856801.3333333333, ans=0.0 2023-10-15 00:09:23,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1856801.3333333333, ans=0.0 2023-10-15 00:09:33,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.844e+02 1.984e+02 2.214e+02 3.420e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-15 00:09:43,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856894.6666666667, ans=0.1 2023-10-15 00:09:55,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1856941.3333333333, ans=0.125 2023-10-15 00:09:58,017 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:09:59,043 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1856941.3333333333, ans=0.125 2023-10-15 00:10:24,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1857034.6666666667, ans=0.0 2023-10-15 00:10:40,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1857081.3333333333, ans=0.2 2023-10-15 00:10:41,561 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1857081.3333333333, ans=0.125 2023-10-15 00:11:02,018 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1857174.6666666667, ans=0.0 2023-10-15 00:11:30,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1857268.0, ans=0.07 2023-10-15 00:11:36,633 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1857314.6666666667, ans=0.125 2023-10-15 00:11:37,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.808e+02 1.953e+02 2.178e+02 3.476e+02, threshold=3.906e+02, percent-clipped=0.0 2023-10-15 00:11:37,221 INFO [train.py:1031] (0/4) Epoch 30, batch 2000, loss[loss=0.1826, simple_loss=0.2826, pruned_loss=0.04132, over 16801.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2773, pruned_loss=0.04582, over 20747674.16 frames. ], batch size: 175, lr: 1.15e-03, grad_scale: 16.0 2023-10-15 00:11:51,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1857361.3333333333, ans=0.0 2023-10-15 00:12:43,942 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1857548.0, ans=0.1 2023-10-15 00:13:13,126 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1857688.0, ans=0.125 2023-10-15 00:13:38,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.779e+02 1.970e+02 2.192e+02 3.720e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-15 00:14:16,815 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:14:45,180 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1857968.0, ans=0.0 2023-10-15 00:15:00,654 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-10-15 00:15:04,590 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-10-15 00:15:16,194 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1858061.3333333333, ans=0.125 2023-10-15 00:15:21,566 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1858108.0, ans=0.125 2023-10-15 00:15:38,125 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1858154.6666666667, ans=0.07 2023-10-15 00:15:53,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 1.880e+02 2.060e+02 2.209e+02 3.063e+02, threshold=4.120e+02, percent-clipped=0.0 2023-10-15 00:16:09,225 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.75 vs. limit=15.0 2023-10-15 00:16:30,757 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1858388.0, ans=0.125 2023-10-15 00:16:42,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1858434.6666666667, ans=0.2 2023-10-15 00:17:00,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1858528.0, ans=0.0 2023-10-15 00:17:03,850 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1858528.0, ans=0.0 2023-10-15 00:17:16,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-10-15 00:17:18,281 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:17:22,613 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1858621.3333333333, ans=0.0 2023-10-15 00:17:40,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.838e+02 2.043e+02 2.258e+02 3.292e+02, threshold=4.086e+02, percent-clipped=0.0 2023-10-15 00:18:02,814 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.71 vs. limit=15.0 2023-10-15 00:18:05,303 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1858808.0, ans=0.0 2023-10-15 00:18:06,470 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=22.5 2023-10-15 00:18:09,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1858808.0, ans=0.2 2023-10-15 00:18:18,721 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1858854.6666666667, ans=0.125 2023-10-15 00:18:26,964 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1858901.3333333333, ans=0.125 2023-10-15 00:18:28,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1858901.3333333333, ans=0.125 2023-10-15 00:18:33,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1858948.0, ans=0.125 2023-10-15 00:18:47,600 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1858994.6666666667, ans=0.1 2023-10-15 00:19:14,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1859088.0, ans=0.125 2023-10-15 00:19:23,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1859134.6666666667, ans=0.0 2023-10-15 00:19:28,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.919e+02 2.075e+02 2.253e+02 2.869e+02, threshold=4.149e+02, percent-clipped=0.0 2023-10-15 00:19:29,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1859181.3333333333, ans=0.0 2023-10-15 00:19:31,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1859181.3333333333, ans=0.05 2023-10-15 00:19:37,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1859181.3333333333, ans=0.1 2023-10-15 00:20:09,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859321.3333333333, ans=0.1 2023-10-15 00:20:14,446 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1859368.0, ans=22.5 2023-10-15 00:20:26,290 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1859414.6666666667, ans=0.125 2023-10-15 00:20:27,465 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.86 vs. limit=15.0 2023-10-15 00:20:42,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1859461.3333333333, ans=0.0 2023-10-15 00:20:49,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1859508.0, ans=0.125 2023-10-15 00:20:52,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1859508.0, ans=0.125 2023-10-15 00:21:15,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.817e+02 1.939e+02 2.098e+02 2.744e+02, threshold=3.877e+02, percent-clipped=0.0 2023-10-15 00:21:15,439 INFO [train.py:1031] (0/4) Epoch 30, batch 2500, loss[loss=0.1786, simple_loss=0.2705, pruned_loss=0.04337, over 17020.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2773, pruned_loss=0.04578, over 23392410.79 frames. ], batch size: 117, lr: 1.15e-03, grad_scale: 32.0 2023-10-15 00:21:34,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1859694.6666666667, ans=0.0 2023-10-15 00:22:21,941 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.09 vs. limit=22.5 2023-10-15 00:22:24,389 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1859928.0, ans=0.125 2023-10-15 00:22:27,823 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-10-15 00:22:39,216 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1860021.3333333333, ans=0.0 2023-10-15 00:22:57,623 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1860068.0, ans=0.125 2023-10-15 00:22:58,036 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.84 vs. limit=15.0 2023-10-15 00:22:59,495 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.847e+02 2.029e+02 2.222e+02 3.312e+02, threshold=4.058e+02, percent-clipped=0.0 2023-10-15 00:23:06,602 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-10-15 00:24:04,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1860348.0, ans=0.0 2023-10-15 00:24:04,621 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-10-15 00:24:09,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=12.0 2023-10-15 00:24:16,706 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1860394.6666666667, ans=0.1 2023-10-15 00:24:18,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1860441.3333333333, ans=0.2 2023-10-15 00:24:18,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1860441.3333333333, ans=0.125 2023-10-15 00:24:53,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1860581.3333333333, ans=0.0 2023-10-15 00:24:54,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.867e+02 2.050e+02 2.248e+02 3.055e+02, threshold=4.100e+02, percent-clipped=0.0 2023-10-15 00:24:57,028 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-15 00:25:15,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1860628.0, ans=0.0 2023-10-15 00:25:23,775 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1860674.6666666667, ans=0.125 2023-10-15 00:25:34,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1860721.3333333333, ans=0.125 2023-10-15 00:25:35,808 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:25:43,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1860768.0, ans=0.125 2023-10-15 00:25:44,855 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=1860768.0, ans=15.0 2023-10-15 00:25:55,697 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1860814.6666666667, ans=0.2 2023-10-15 00:26:36,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.29 vs. limit=22.5 2023-10-15 00:26:37,354 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1860954.6666666667, ans=0.125 2023-10-15 00:26:54,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.817e+02 1.951e+02 2.202e+02 3.582e+02, threshold=3.901e+02, percent-clipped=0.0 2023-10-15 00:27:19,609 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1861141.3333333333, ans=0.1 2023-10-15 00:27:26,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1861188.0, ans=0.125 2023-10-15 00:27:33,292 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:27:41,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-10-15 00:27:49,128 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1861234.6666666667, ans=0.0 2023-10-15 00:28:20,356 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1861374.6666666667, ans=0.1 2023-10-15 00:28:33,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1861421.3333333333, ans=0.125 2023-10-15 00:28:58,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.819e+02 1.967e+02 2.136e+02 2.828e+02, threshold=3.935e+02, percent-clipped=0.0 2023-10-15 00:29:12,318 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1861561.3333333333, ans=0.0 2023-10-15 00:29:59,117 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1861748.0, ans=0.1 2023-10-15 00:30:01,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1861794.6666666667, ans=0.05 2023-10-15 00:30:07,520 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1861794.6666666667, ans=0.07 2023-10-15 00:30:17,905 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.55 vs. limit=12.0 2023-10-15 00:30:44,617 INFO [train.py:1031] (0/4) Epoch 30, batch 3000, loss[loss=0.1539, simple_loss=0.2502, pruned_loss=0.0288, over 16948.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2766, pruned_loss=0.04572, over 25470084.06 frames. ], batch size: 93, lr: 1.15e-03, grad_scale: 16.0 2023-10-15 00:30:46,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.798e+02 1.977e+02 2.189e+02 4.054e+02, threshold=3.955e+02, percent-clipped=1.0 2023-10-15 00:30:48,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1861981.3333333333, ans=0.09899494936611666 2023-10-15 00:30:50,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1861981.3333333333, ans=0.0 2023-10-15 00:31:06,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1862074.6666666667, ans=0.125 2023-10-15 00:31:28,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1862168.0, ans=0.0 2023-10-15 00:31:28,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1862168.0, ans=0.2 2023-10-15 00:31:36,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1862168.0, ans=0.2 2023-10-15 00:31:36,586 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.69 vs. limit=15.0 2023-10-15 00:31:36,746 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.36 vs. limit=22.5 2023-10-15 00:31:55,592 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:32:03,439 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1862308.0, ans=0.125 2023-10-15 00:32:04,579 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-10-15 00:32:05,529 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.75 vs. limit=15.0 2023-10-15 00:32:23,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1862401.3333333333, ans=0.125 2023-10-15 00:32:39,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.915e+02 2.077e+02 2.261e+02 4.211e+02, threshold=4.154e+02, percent-clipped=1.0 2023-10-15 00:32:49,940 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1862494.6666666667, ans=0.2 2023-10-15 00:32:57,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1862494.6666666667, ans=0.0 2023-10-15 00:33:04,130 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-10-15 00:33:07,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1862541.3333333333, ans=0.125 2023-10-15 00:33:08,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1862541.3333333333, ans=0.125 2023-10-15 00:33:09,820 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1862541.3333333333, ans=0.125 2023-10-15 00:33:14,036 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:34:14,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1862821.3333333333, ans=0.0 2023-10-15 00:34:29,785 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.788e+02 1.983e+02 2.184e+02 2.870e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-15 00:34:31,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1862914.6666666667, ans=0.125 2023-10-15 00:34:31,693 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-10-15 00:34:34,256 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-10-15 00:34:34,956 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1862914.6666666667, ans=0.0 2023-10-15 00:34:37,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1862961.3333333333, ans=0.0 2023-10-15 00:34:58,657 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1863008.0, ans=0.0 2023-10-15 00:34:59,925 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1863008.0, ans=0.2 2023-10-15 00:35:03,262 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-10-15 00:35:18,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=1863054.6666666667, ans=0.0 2023-10-15 00:35:39,164 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1863148.0, ans=0.125 2023-10-15 00:35:42,913 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1863148.0, ans=0.125 2023-10-15 00:35:56,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1863194.6666666667, ans=0.04949747468305833 2023-10-15 00:36:36,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.829e+02 2.037e+02 2.240e+02 2.885e+02, threshold=4.075e+02, percent-clipped=0.0 2023-10-15 00:36:41,408 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1863381.3333333333, ans=0.0 2023-10-15 00:36:46,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1863428.0, ans=0.125 2023-10-15 00:36:57,866 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=15.0 2023-10-15 00:37:01,029 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1863474.6666666667, ans=0.2 2023-10-15 00:37:09,591 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=1863521.3333333333, ans=0.125 2023-10-15 00:37:11,598 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.45 vs. limit=22.5 2023-10-15 00:37:22,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1863568.0, ans=0.125 2023-10-15 00:37:24,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1863568.0, ans=0.125 2023-10-15 00:37:34,780 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1863661.3333333333, ans=0.125 2023-10-15 00:37:38,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1863661.3333333333, ans=0.125 2023-10-15 00:37:55,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=1863708.0, ans=0.125 2023-10-15 00:38:00,696 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.15 vs. limit=22.5 2023-10-15 00:38:12,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1863754.6666666667, ans=0.125 2023-10-15 00:38:17,156 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.31 vs. limit=22.5 2023-10-15 00:38:23,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1863848.0, ans=0.0 2023-10-15 00:38:23,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1863848.0, ans=0.2 2023-10-15 00:38:26,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1863848.0, ans=0.125 2023-10-15 00:38:27,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.887e+02 2.029e+02 2.203e+02 2.920e+02, threshold=4.057e+02, percent-clipped=0.0 2023-10-15 00:38:30,044 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-10-15 00:38:35,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1863894.6666666667, ans=0.125 2023-10-15 00:38:39,904 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1863894.6666666667, ans=0.125 2023-10-15 00:38:48,173 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=1863941.3333333333, ans=0.025 2023-10-15 00:39:03,090 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:39:06,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1864034.6666666667, ans=0.125 2023-10-15 00:39:29,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1864128.0, ans=0.1 2023-10-15 00:39:32,455 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1864128.0, ans=0.1 2023-10-15 00:39:33,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1864128.0, ans=0.2 2023-10-15 00:39:36,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1864128.0, ans=0.0 2023-10-15 00:39:36,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1864128.0, ans=0.125 2023-10-15 00:39:38,632 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1864174.6666666667, ans=0.125 2023-10-15 00:39:38,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1864174.6666666667, ans=0.0 2023-10-15 00:39:40,602 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1864174.6666666667, ans=0.125 2023-10-15 00:40:00,442 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.02 vs. limit=10.0 2023-10-15 00:40:01,302 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1864268.0, ans=0.125 2023-10-15 00:40:10,118 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:40:13,569 INFO [train.py:1031] (0/4) Epoch 30, batch 3500, loss[loss=0.1898, simple_loss=0.2561, pruned_loss=0.06171, over 12430.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2767, pruned_loss=0.04592, over 27090974.92 frames. ], batch size: 440, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 00:40:16,194 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.842e+02 1.992e+02 2.110e+02 3.078e+02, threshold=3.984e+02, percent-clipped=0.0 2023-10-15 00:40:24,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1864361.3333333333, ans=0.2 2023-10-15 00:40:37,634 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-10-15 00:40:48,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1864454.6666666667, ans=0.1 2023-10-15 00:41:09,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1864548.0, ans=0.125 2023-10-15 00:41:22,373 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.17 vs. limit=15.0 2023-10-15 00:41:45,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1864641.3333333333, ans=0.0 2023-10-15 00:41:50,722 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1864688.0, ans=0.1 2023-10-15 00:41:55,085 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1864688.0, ans=0.2 2023-10-15 00:41:59,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1864688.0, ans=0.1 2023-10-15 00:42:17,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.895e+02 2.070e+02 2.280e+02 3.076e+02, threshold=4.140e+02, percent-clipped=0.0 2023-10-15 00:42:21,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1864781.3333333333, ans=6.0 2023-10-15 00:42:24,048 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:42:24,206 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.04 vs. limit=22.5 2023-10-15 00:42:35,089 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1864828.0, ans=0.125 2023-10-15 00:42:43,612 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1864874.6666666667, ans=0.125 2023-10-15 00:42:47,803 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1864874.6666666667, ans=0.125 2023-10-15 00:42:52,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1864921.3333333333, ans=0.0 2023-10-15 00:42:52,247 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1864921.3333333333, ans=0.2 2023-10-15 00:43:00,039 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1864921.3333333333, ans=0.125 2023-10-15 00:43:08,019 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-10-15 00:43:14,098 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1865014.6666666667, ans=0.125 2023-10-15 00:43:15,124 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:43:17,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1865014.6666666667, ans=0.07 2023-10-15 00:43:19,939 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1865014.6666666667, ans=0.125 2023-10-15 00:43:24,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1865014.6666666667, ans=0.1 2023-10-15 00:43:44,762 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1865108.0, ans=0.0 2023-10-15 00:44:07,514 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.83 vs. limit=15.0 2023-10-15 00:44:07,689 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.60 vs. limit=22.5 2023-10-15 00:44:10,076 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1865248.0, ans=0.0 2023-10-15 00:44:14,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.800e+02 1.894e+02 2.079e+02 3.098e+02, threshold=3.788e+02, percent-clipped=0.0 2023-10-15 00:44:16,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1865248.0, ans=0.125 2023-10-15 00:44:21,584 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.97 vs. limit=15.0 2023-10-15 00:44:23,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1865294.6666666667, ans=0.2 2023-10-15 00:44:39,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1865341.3333333333, ans=0.2 2023-10-15 00:44:58,809 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 00:45:06,245 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1865434.6666666667, ans=0.0 2023-10-15 00:45:32,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1865528.0, ans=0.125 2023-10-15 00:45:40,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1865574.6666666667, ans=0.125 2023-10-15 00:45:47,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1865574.6666666667, ans=0.0 2023-10-15 00:45:59,420 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1865668.0, ans=0.125 2023-10-15 00:46:09,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1865714.6666666667, ans=0.125 2023-10-15 00:46:12,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1865714.6666666667, ans=0.125 2023-10-15 00:46:14,282 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.23 vs. limit=15.0 2023-10-15 00:46:14,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.904e+02 2.034e+02 2.215e+02 3.907e+02, threshold=4.067e+02, percent-clipped=2.0 2023-10-15 00:46:24,660 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1865761.3333333333, ans=0.2 2023-10-15 00:46:32,757 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.34 vs. limit=15.0 2023-10-15 00:46:34,004 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=22.5 2023-10-15 00:46:42,251 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-10-15 00:46:42,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.79 vs. limit=15.0 2023-10-15 00:46:54,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1865854.6666666667, ans=0.0 2023-10-15 00:47:32,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1866041.3333333333, ans=0.0 2023-10-15 00:47:53,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1866134.6666666667, ans=0.1 2023-10-15 00:48:01,580 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1866181.3333333333, ans=0.125 2023-10-15 00:48:06,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.798e+02 1.964e+02 2.087e+02 2.752e+02, threshold=3.928e+02, percent-clipped=0.0 2023-10-15 00:48:17,971 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1866228.0, ans=0.0 2023-10-15 00:48:20,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1866228.0, ans=0.125 2023-10-15 00:48:40,622 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.06 vs. limit=22.5 2023-10-15 00:49:17,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1866508.0, ans=0.025 2023-10-15 00:49:22,006 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1866508.0, ans=0.125 2023-10-15 00:49:50,498 INFO [train.py:1031] (0/4) Epoch 30, batch 4000, loss[loss=0.2116, simple_loss=0.3102, pruned_loss=0.05652, over 16675.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2763, pruned_loss=0.04587, over 28362354.36 frames. ], batch size: 202, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 00:49:50,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1866648.0, ans=0.125 2023-10-15 00:49:55,158 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-400000.pt 2023-10-15 00:49:59,556 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.849e+02 1.997e+02 2.108e+02 3.017e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-15 00:50:08,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1866694.6666666667, ans=0.2 2023-10-15 00:50:41,242 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1866834.6666666667, ans=0.125 2023-10-15 00:50:47,572 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1866834.6666666667, ans=0.125 2023-10-15 00:50:57,614 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.71 vs. limit=15.0 2023-10-15 00:51:05,839 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1866928.0, ans=0.125 2023-10-15 00:51:30,274 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.20 vs. limit=15.0 2023-10-15 00:51:36,388 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-10-15 00:51:38,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1867068.0, ans=0.1 2023-10-15 00:51:40,490 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-10-15 00:51:46,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1867114.6666666667, ans=0.1 2023-10-15 00:51:48,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1867114.6666666667, ans=0.1 2023-10-15 00:51:48,343 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.60 vs. limit=15.0 2023-10-15 00:51:51,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.642e+02 1.959e+02 2.186e+02 2.389e+02 3.914e+02, threshold=4.373e+02, percent-clipped=0.0 2023-10-15 00:51:53,505 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.61 vs. limit=15.0 2023-10-15 00:51:54,108 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1867114.6666666667, ans=0.0 2023-10-15 00:51:54,907 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1867114.6666666667, ans=0.125 2023-10-15 00:52:48,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1867348.0, ans=0.125 2023-10-15 00:52:56,443 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1867394.6666666667, ans=0.0 2023-10-15 00:53:08,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1867394.6666666667, ans=0.07 2023-10-15 00:53:10,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1867441.3333333333, ans=0.1 2023-10-15 00:53:44,856 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1867534.6666666667, ans=0.125 2023-10-15 00:53:48,359 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1867534.6666666667, ans=0.125 2023-10-15 00:53:54,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1867581.3333333333, ans=0.125 2023-10-15 00:53:54,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.849e+02 1.983e+02 2.128e+02 3.216e+02, threshold=3.966e+02, percent-clipped=0.0 2023-10-15 00:53:56,210 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1867581.3333333333, ans=0.1 2023-10-15 00:54:39,831 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1867768.0, ans=0.125 2023-10-15 00:54:49,816 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1867814.6666666667, ans=0.2 2023-10-15 00:55:09,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1867908.0, ans=0.125 2023-10-15 00:55:18,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1867954.6666666667, ans=0.125 2023-10-15 00:55:18,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1867954.6666666667, ans=6.0 2023-10-15 00:55:22,192 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.58 vs. limit=15.0 2023-10-15 00:55:43,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.871e+02 2.144e+02 2.430e+02 3.741e+02, threshold=4.289e+02, percent-clipped=0.0 2023-10-15 00:56:02,444 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1868141.3333333333, ans=0.0 2023-10-15 00:56:35,185 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1868281.3333333333, ans=0.1 2023-10-15 00:56:38,072 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1868281.3333333333, ans=0.125 2023-10-15 00:56:42,397 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=1868328.0, ans=0.0 2023-10-15 00:56:48,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1868328.0, ans=0.125 2023-10-15 00:57:04,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1868374.6666666667, ans=0.125 2023-10-15 00:57:10,109 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.70 vs. limit=22.5 2023-10-15 00:57:12,033 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=7.83 vs. limit=22.5 2023-10-15 00:57:27,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1868468.0, ans=0.125 2023-10-15 00:57:27,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1868468.0, ans=0.125 2023-10-15 00:57:36,963 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.867e+02 1.995e+02 2.226e+02 3.194e+02, threshold=3.991e+02, percent-clipped=0.0 2023-10-15 00:58:34,564 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1868701.3333333333, ans=0.125 2023-10-15 00:58:51,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1868794.6666666667, ans=0.0 2023-10-15 00:59:08,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1868841.3333333333, ans=0.1 2023-10-15 00:59:37,998 INFO [train.py:1031] (0/4) Epoch 30, batch 4500, loss[loss=0.1642, simple_loss=0.2596, pruned_loss=0.03443, over 16989.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2765, pruned_loss=0.04567, over 29330444.52 frames. ], batch size: 77, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 00:59:42,414 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1868981.3333333333, ans=0.0 2023-10-15 00:59:44,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.891e+02 2.120e+02 2.274e+02 3.174e+02, threshold=4.239e+02, percent-clipped=0.0 2023-10-15 01:00:16,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1869121.3333333333, ans=0.125 2023-10-15 01:00:25,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1869168.0, ans=0.125 2023-10-15 01:00:28,715 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.79 vs. limit=15.0 2023-10-15 01:00:31,146 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1869168.0, ans=0.0 2023-10-15 01:00:41,135 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1869214.6666666667, ans=0.125 2023-10-15 01:01:21,492 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.36 vs. limit=22.5 2023-10-15 01:01:28,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1869448.0, ans=0.0 2023-10-15 01:01:33,476 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1869448.0, ans=0.0 2023-10-15 01:01:34,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.904e+02 2.030e+02 2.206e+02 2.881e+02, threshold=4.060e+02, percent-clipped=0.0 2023-10-15 01:02:07,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1869634.6666666667, ans=0.125 2023-10-15 01:02:35,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1869728.0, ans=0.2 2023-10-15 01:02:44,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1869774.6666666667, ans=0.125 2023-10-15 01:02:47,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1869774.6666666667, ans=0.125 2023-10-15 01:02:58,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1869821.3333333333, ans=0.125 2023-10-15 01:03:00,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1869821.3333333333, ans=0.2 2023-10-15 01:03:21,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1869914.6666666667, ans=0.0 2023-10-15 01:03:22,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.997e+02 2.196e+02 2.444e+02 3.253e+02, threshold=4.392e+02, percent-clipped=0.0 2023-10-15 01:03:29,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1869961.3333333333, ans=0.125 2023-10-15 01:03:36,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1869961.3333333333, ans=0.125 2023-10-15 01:03:38,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1870008.0, ans=0.0 2023-10-15 01:03:53,514 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1870054.6666666667, ans=0.0 2023-10-15 01:03:59,195 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1870101.3333333333, ans=0.0 2023-10-15 01:04:13,097 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1870148.0, ans=0.2 2023-10-15 01:04:24,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1870194.6666666667, ans=0.5 2023-10-15 01:04:26,201 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=1870194.6666666667, ans=0.0 2023-10-15 01:04:26,366 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1870194.6666666667, ans=15.0 2023-10-15 01:04:27,080 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1870194.6666666667, ans=0.02 2023-10-15 01:04:37,626 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.32 vs. limit=15.0 2023-10-15 01:04:38,035 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1870241.3333333333, ans=10.0 2023-10-15 01:04:38,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1870241.3333333333, ans=0.125 2023-10-15 01:04:45,118 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1870288.0, ans=0.125 2023-10-15 01:04:52,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1870288.0, ans=0.0 2023-10-15 01:05:12,105 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.827e+02 1.988e+02 2.163e+02 3.045e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-15 01:05:14,510 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1870381.3333333333, ans=0.125 2023-10-15 01:05:24,694 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.19 vs. limit=15.0 2023-10-15 01:05:35,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=15.0 2023-10-15 01:05:53,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1870521.3333333333, ans=0.1 2023-10-15 01:05:55,658 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870521.3333333333, ans=0.1 2023-10-15 01:06:09,358 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1870614.6666666667, ans=0.0 2023-10-15 01:06:56,850 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-10-15 01:06:57,003 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-10-15 01:07:11,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.852e+02 2.029e+02 2.177e+02 2.771e+02, threshold=4.059e+02, percent-clipped=0.0 2023-10-15 01:07:17,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1870894.6666666667, ans=0.0 2023-10-15 01:07:29,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1870941.3333333333, ans=0.0 2023-10-15 01:07:51,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1871034.6666666667, ans=0.125 2023-10-15 01:07:51,027 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1871034.6666666667, ans=0.125 2023-10-15 01:07:54,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-10-15 01:07:55,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1871034.6666666667, ans=0.2 2023-10-15 01:07:56,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1871034.6666666667, ans=0.0 2023-10-15 01:08:01,962 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=1871081.3333333333, ans=0.0 2023-10-15 01:08:02,165 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.50 vs. limit=22.5 2023-10-15 01:08:18,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1871128.0, ans=0.125 2023-10-15 01:08:20,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1871128.0, ans=0.125 2023-10-15 01:08:55,081 INFO [train.py:1031] (0/4) Epoch 30, batch 5000, loss[loss=0.1715, simple_loss=0.2725, pruned_loss=0.03525, over 16826.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.276, pruned_loss=0.04565, over 30116570.82 frames. ], batch size: 87, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 01:08:57,325 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:09:01,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.886e+02 2.036e+02 2.222e+02 3.544e+02, threshold=4.073e+02, percent-clipped=0.0 2023-10-15 01:09:07,392 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=1871361.3333333333, ans=0.125 2023-10-15 01:09:31,367 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1871454.6666666667, ans=0.125 2023-10-15 01:09:52,050 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-10-15 01:10:10,351 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1871594.6666666667, ans=0.2 2023-10-15 01:10:20,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1871641.3333333333, ans=0.125 2023-10-15 01:10:27,143 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1871688.0, ans=0.1 2023-10-15 01:10:39,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1871734.6666666667, ans=0.125 2023-10-15 01:10:54,370 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.28 vs. limit=15.0 2023-10-15 01:10:56,008 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1871781.3333333333, ans=10.0 2023-10-15 01:10:57,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.965e+02 2.173e+02 2.444e+02 4.363e+02, threshold=4.346e+02, percent-clipped=1.0 2023-10-15 01:11:15,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1871874.6666666667, ans=0.0 2023-10-15 01:11:36,909 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1871968.0, ans=0.125 2023-10-15 01:11:49,116 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1872014.6666666667, ans=0.05 2023-10-15 01:11:49,516 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-10-15 01:12:07,370 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1872108.0, ans=0.0 2023-10-15 01:12:07,648 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.63 vs. limit=12.0 2023-10-15 01:12:27,568 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1872201.3333333333, ans=0.025 2023-10-15 01:12:38,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1872248.0, ans=0.2 2023-10-15 01:12:45,193 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.873e+02 1.998e+02 2.198e+02 2.761e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-15 01:12:48,016 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1872294.6666666667, ans=0.1 2023-10-15 01:13:02,224 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1872341.3333333333, ans=0.1 2023-10-15 01:13:09,468 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1872388.0, ans=0.0 2023-10-15 01:13:12,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1872388.0, ans=0.1 2023-10-15 01:13:22,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1872434.6666666667, ans=0.0 2023-10-15 01:13:29,428 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1872434.6666666667, ans=0.125 2023-10-15 01:13:35,257 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-15 01:13:44,498 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.96 vs. limit=10.0 2023-10-15 01:13:55,289 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1872528.0, ans=0.0 2023-10-15 01:13:56,804 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1872574.6666666667, ans=0.09899494936611666 2023-10-15 01:14:06,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1872574.6666666667, ans=0.0 2023-10-15 01:14:24,011 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1872668.0, ans=0.0 2023-10-15 01:14:25,070 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1872668.0, ans=0.0 2023-10-15 01:14:27,511 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.15 vs. limit=15.0 2023-10-15 01:14:46,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.856e+02 1.995e+02 2.179e+02 3.417e+02, threshold=3.990e+02, percent-clipped=0.0 2023-10-15 01:15:11,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1872808.0, ans=0.1 2023-10-15 01:15:11,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=12.0 2023-10-15 01:15:36,102 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.68 vs. limit=15.0 2023-10-15 01:16:52,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1873181.3333333333, ans=0.125 2023-10-15 01:17:02,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.434e+02 1.797e+02 1.984e+02 2.156e+02 2.674e+02, threshold=3.968e+02, percent-clipped=0.0 2023-10-15 01:17:20,835 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1873274.6666666667, ans=0.125 2023-10-15 01:17:22,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1873274.6666666667, ans=0.0 2023-10-15 01:17:35,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1873321.3333333333, ans=0.125 2023-10-15 01:17:51,992 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1873414.6666666667, ans=0.1 2023-10-15 01:17:54,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1873414.6666666667, ans=0.1 2023-10-15 01:18:02,316 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.11 vs. limit=22.5 2023-10-15 01:18:26,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1873508.0, ans=0.0 2023-10-15 01:18:30,125 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=12.0 2023-10-15 01:18:36,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=1873554.6666666667, ans=0.125 2023-10-15 01:18:39,746 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1873601.3333333333, ans=0.125 2023-10-15 01:18:51,461 INFO [train.py:1031] (0/4) Epoch 30, batch 5500, loss[loss=0.1823, simple_loss=0.277, pruned_loss=0.04379, over 16874.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2759, pruned_loss=0.04555, over 30717622.41 frames. ], batch size: 110, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 01:18:56,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1873648.0, ans=0.125 2023-10-15 01:19:00,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.856e+02 2.021e+02 2.243e+02 2.879e+02, threshold=4.043e+02, percent-clipped=0.0 2023-10-15 01:19:05,296 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1873694.6666666667, ans=0.2 2023-10-15 01:19:33,819 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1873788.0, ans=0.1 2023-10-15 01:19:35,785 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1873788.0, ans=0.0 2023-10-15 01:19:54,930 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1873881.3333333333, ans=0.125 2023-10-15 01:20:27,066 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1874021.3333333333, ans=0.125 2023-10-15 01:20:27,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1874021.3333333333, ans=0.125 2023-10-15 01:21:00,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.817e+02 1.985e+02 2.232e+02 3.003e+02, threshold=3.969e+02, percent-clipped=0.0 2023-10-15 01:21:02,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1874161.3333333333, ans=0.125 2023-10-15 01:21:09,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1874161.3333333333, ans=0.125 2023-10-15 01:21:10,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1874161.3333333333, ans=0.0 2023-10-15 01:21:29,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1874254.6666666667, ans=0.125 2023-10-15 01:21:29,429 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.927e-02 2023-10-15 01:21:48,324 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1874301.3333333333, ans=0.0 2023-10-15 01:22:18,845 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1874441.3333333333, ans=0.0 2023-10-15 01:22:25,413 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1874441.3333333333, ans=0.0 2023-10-15 01:22:33,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1874488.0, ans=0.125 2023-10-15 01:22:39,106 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.27 vs. limit=10.0 2023-10-15 01:22:55,500 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1874581.3333333333, ans=0.0 2023-10-15 01:23:05,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.889e+02 2.052e+02 2.253e+02 2.995e+02, threshold=4.104e+02, percent-clipped=0.0 2023-10-15 01:24:12,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.73 vs. limit=15.0 2023-10-15 01:24:15,429 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1874861.3333333333, ans=0.04949747468305833 2023-10-15 01:24:22,534 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1874908.0, ans=0.1 2023-10-15 01:24:31,764 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1874908.0, ans=0.0 2023-10-15 01:24:33,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874954.6666666667, ans=0.1 2023-10-15 01:24:38,556 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=1874954.6666666667, ans=10.0 2023-10-15 01:24:43,702 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1874954.6666666667, ans=0.0 2023-10-15 01:24:49,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1875001.3333333333, ans=0.2 2023-10-15 01:24:57,809 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1875001.3333333333, ans=0.5 2023-10-15 01:25:09,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1875048.0, ans=0.125 2023-10-15 01:25:09,958 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.78 vs. limit=15.0 2023-10-15 01:25:11,464 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.860e+02 1.996e+02 2.141e+02 2.689e+02, threshold=3.991e+02, percent-clipped=0.0 2023-10-15 01:25:22,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1875094.6666666667, ans=0.07 2023-10-15 01:25:23,531 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:25:24,637 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1875141.3333333333, ans=0.0 2023-10-15 01:25:30,784 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1875141.3333333333, ans=15.0 2023-10-15 01:26:05,470 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1875234.6666666667, ans=0.1 2023-10-15 01:26:38,113 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1875374.6666666667, ans=0.125 2023-10-15 01:26:55,887 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1875468.0, ans=0.1 2023-10-15 01:27:04,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1875468.0, ans=0.0 2023-10-15 01:27:06,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1875468.0, ans=0.125 2023-10-15 01:27:19,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.791e+02 1.995e+02 2.136e+02 2.610e+02, threshold=3.990e+02, percent-clipped=0.0 2023-10-15 01:27:20,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1875561.3333333333, ans=0.2 2023-10-15 01:27:39,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1875608.0, ans=0.05 2023-10-15 01:27:59,888 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1875701.3333333333, ans=0.125 2023-10-15 01:28:06,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1875701.3333333333, ans=0.0 2023-10-15 01:28:19,597 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=15.0 2023-10-15 01:28:50,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1875888.0, ans=0.125 2023-10-15 01:29:00,648 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1875934.6666666667, ans=0.125 2023-10-15 01:29:11,024 INFO [train.py:1031] (0/4) Epoch 30, batch 6000, loss[loss=0.1877, simple_loss=0.2795, pruned_loss=0.04798, over 16322.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2764, pruned_loss=0.04598, over 31187494.26 frames. ], batch size: 50, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 01:29:24,285 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.911e+02 2.122e+02 2.327e+02 3.056e+02, threshold=4.245e+02, percent-clipped=0.0 2023-10-15 01:30:03,983 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1876168.0, ans=0.125 2023-10-15 01:30:12,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.29 vs. limit=15.0 2023-10-15 01:30:28,998 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1876261.3333333333, ans=0.125 2023-10-15 01:31:22,808 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1876448.0, ans=0.2 2023-10-15 01:31:25,455 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.915e+02 2.131e+02 2.443e+02 3.249e+02, threshold=4.263e+02, percent-clipped=0.0 2023-10-15 01:31:27,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1876494.6666666667, ans=0.125 2023-10-15 01:31:35,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1876494.6666666667, ans=0.125 2023-10-15 01:31:44,255 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1876541.3333333333, ans=0.0 2023-10-15 01:31:50,007 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1876588.0, ans=0.2 2023-10-15 01:32:07,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1876634.6666666667, ans=0.125 2023-10-15 01:32:14,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1876681.3333333333, ans=0.2 2023-10-15 01:32:48,277 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:33:20,141 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1876914.6666666667, ans=0.1 2023-10-15 01:33:23,425 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.65 vs. limit=22.5 2023-10-15 01:33:26,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.896e+02 2.061e+02 2.219e+02 4.062e+02, threshold=4.122e+02, percent-clipped=0.0 2023-10-15 01:34:24,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.14 vs. limit=22.5 2023-10-15 01:34:36,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1877194.6666666667, ans=0.0 2023-10-15 01:34:56,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1877288.0, ans=0.0 2023-10-15 01:35:17,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1877334.6666666667, ans=0.09899494936611666 2023-10-15 01:35:26,857 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1877381.3333333333, ans=0.125 2023-10-15 01:35:34,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.867e+02 2.051e+02 2.298e+02 3.159e+02, threshold=4.102e+02, percent-clipped=0.0 2023-10-15 01:35:37,119 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1877428.0, ans=0.125 2023-10-15 01:36:03,244 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1877521.3333333333, ans=0.1 2023-10-15 01:36:21,005 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1877568.0, ans=0.0 2023-10-15 01:36:27,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1877614.6666666667, ans=0.2 2023-10-15 01:36:32,482 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-10-15 01:36:53,538 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1877661.3333333333, ans=0.125 2023-10-15 01:37:43,764 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-10-15 01:37:50,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.831e+02 1.986e+02 2.130e+02 3.000e+02, threshold=3.972e+02, percent-clipped=0.0 2023-10-15 01:38:23,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1877988.0, ans=0.125 2023-10-15 01:38:28,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1878034.6666666667, ans=0.04949747468305833 2023-10-15 01:38:29,708 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.13 vs. limit=15.0 2023-10-15 01:38:37,989 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1878034.6666666667, ans=0.125 2023-10-15 01:38:47,231 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1878081.3333333333, ans=0.125 2023-10-15 01:38:59,675 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1878128.0, ans=0.2 2023-10-15 01:39:02,047 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:39:08,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1878174.6666666667, ans=0.125 2023-10-15 01:39:27,188 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 01:39:47,745 INFO [train.py:1031] (0/4) Epoch 30, batch 6500, loss[loss=0.1932, simple_loss=0.2824, pruned_loss=0.05198, over 16005.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.277, pruned_loss=0.04614, over 31548954.66 frames. ], batch size: 296, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 01:39:52,428 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.35 vs. limit=22.5 2023-10-15 01:39:54,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1878314.6666666667, ans=0.1 2023-10-15 01:40:02,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1878314.6666666667, ans=0.1 2023-10-15 01:40:06,174 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.962e+02 2.214e+02 2.471e+02 3.623e+02, threshold=4.428e+02, percent-clipped=0.0 2023-10-15 01:41:18,362 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1878594.6666666667, ans=0.125 2023-10-15 01:41:34,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1878641.3333333333, ans=0.0 2023-10-15 01:42:20,052 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.21 vs. limit=22.5 2023-10-15 01:42:20,266 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.864e+02 1.983e+02 2.198e+02 3.063e+02, threshold=3.967e+02, percent-clipped=0.0 2023-10-15 01:42:24,778 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1878828.0, ans=0.0 2023-10-15 01:42:44,335 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.75 vs. limit=10.0 2023-10-15 01:42:48,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1878921.3333333333, ans=0.125 2023-10-15 01:43:30,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1879108.0, ans=0.5 2023-10-15 01:43:31,841 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.53 vs. limit=15.0 2023-10-15 01:44:08,994 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1879248.0, ans=0.0 2023-10-15 01:44:23,331 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 1.866e+02 2.013e+02 2.338e+02 3.143e+02, threshold=4.026e+02, percent-clipped=0.0 2023-10-15 01:44:34,384 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.27 vs. limit=22.5 2023-10-15 01:44:38,102 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1879341.3333333333, ans=0.125 2023-10-15 01:44:43,872 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1879341.3333333333, ans=0.0 2023-10-15 01:44:47,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1879388.0, ans=0.0 2023-10-15 01:44:51,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-10-15 01:45:26,986 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1879528.0, ans=0.125 2023-10-15 01:45:31,383 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1879528.0, ans=10.0 2023-10-15 01:45:36,335 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1879528.0, ans=0.125 2023-10-15 01:45:47,321 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1879574.6666666667, ans=0.0 2023-10-15 01:45:50,946 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1879621.3333333333, ans=0.125 2023-10-15 01:45:50,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1879621.3333333333, ans=0.125 2023-10-15 01:46:06,317 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.12 vs. limit=22.5 2023-10-15 01:46:32,753 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1879714.6666666667, ans=0.0 2023-10-15 01:46:38,733 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.838e+02 2.004e+02 2.365e+02 4.217e+02, threshold=4.008e+02, percent-clipped=1.0 2023-10-15 01:46:44,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-10-15 01:46:57,506 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.66 vs. limit=15.0 2023-10-15 01:47:23,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1879901.3333333333, ans=0.125 2023-10-15 01:47:48,501 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.98 vs. limit=15.0 2023-10-15 01:47:51,312 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=1879994.6666666667, ans=0.1 2023-10-15 01:47:54,229 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1879994.6666666667, ans=0.1 2023-10-15 01:48:05,026 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.40 vs. limit=10.0 2023-10-15 01:48:10,741 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1880041.3333333333, ans=0.0 2023-10-15 01:48:22,492 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1880088.0, ans=0.125 2023-10-15 01:48:30,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1880134.6666666667, ans=0.0 2023-10-15 01:48:36,793 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1880134.6666666667, ans=0.0 2023-10-15 01:48:48,402 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-10-15 01:48:56,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1880228.0, ans=0.125 2023-10-15 01:48:58,578 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.751e+02 1.872e+02 2.083e+02 3.635e+02, threshold=3.744e+02, percent-clipped=0.0 2023-10-15 01:48:59,398 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.25 vs. limit=15.0 2023-10-15 01:48:59,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1880228.0, ans=0.035 2023-10-15 01:49:17,046 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1880274.6666666667, ans=0.125 2023-10-15 01:49:27,067 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.35 vs. limit=22.5 2023-10-15 01:49:34,830 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1880321.3333333333, ans=0.0 2023-10-15 01:49:40,825 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1880368.0, ans=0.125 2023-10-15 01:49:45,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1880368.0, ans=0.125 2023-10-15 01:49:45,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1880368.0, ans=0.0 2023-10-15 01:49:55,594 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1880414.6666666667, ans=0.125 2023-10-15 01:50:29,062 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.04 vs. limit=15.0 2023-10-15 01:50:41,597 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.83 vs. limit=15.0 2023-10-15 01:50:49,706 INFO [train.py:1031] (0/4) Epoch 30, batch 7000, loss[loss=0.1798, simple_loss=0.272, pruned_loss=0.04377, over 16917.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2774, pruned_loss=0.04611, over 31834817.03 frames. ], batch size: 138, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 01:51:01,596 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1880694.6666666667, ans=0.0 2023-10-15 01:51:06,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.909e+02 2.083e+02 2.272e+02 2.871e+02, threshold=4.166e+02, percent-clipped=0.0 2023-10-15 01:51:12,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1880694.6666666667, ans=0.125 2023-10-15 01:51:27,041 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1880741.3333333333, ans=0.0 2023-10-15 01:51:45,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1880834.6666666667, ans=0.2 2023-10-15 01:51:57,834 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1880881.3333333333, ans=0.0 2023-10-15 01:52:09,599 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-10-15 01:52:20,914 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1880974.6666666667, ans=0.0 2023-10-15 01:52:29,430 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-10-15 01:52:51,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1881114.6666666667, ans=0.125 2023-10-15 01:52:57,080 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.62 vs. limit=15.0 2023-10-15 01:53:02,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.864e+02 1.989e+02 2.255e+02 2.619e+02, threshold=3.977e+02, percent-clipped=0.0 2023-10-15 01:53:10,028 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1881161.3333333333, ans=0.2 2023-10-15 01:53:26,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1881208.0, ans=0.125 2023-10-15 01:53:35,925 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-10-15 01:53:42,273 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1881301.3333333333, ans=0.125 2023-10-15 01:53:50,656 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1881301.3333333333, ans=0.0 2023-10-15 01:54:14,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1881394.6666666667, ans=0.1 2023-10-15 01:54:35,441 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1881488.0, ans=0.125 2023-10-15 01:54:57,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1881534.6666666667, ans=0.2 2023-10-15 01:55:23,878 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.878e+02 2.034e+02 2.237e+02 2.721e+02, threshold=4.069e+02, percent-clipped=0.0 2023-10-15 01:55:59,563 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1881721.3333333333, ans=0.0 2023-10-15 01:56:10,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1881721.3333333333, ans=0.125 2023-10-15 01:56:35,530 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1881814.6666666667, ans=0.0 2023-10-15 01:56:37,055 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.47 vs. limit=5.0 2023-10-15 01:56:41,031 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1881861.3333333333, ans=0.125 2023-10-15 01:56:55,650 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1881908.0, ans=0.125 2023-10-15 01:56:56,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1881908.0, ans=0.05 2023-10-15 01:57:11,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1881954.6666666667, ans=0.125 2023-10-15 01:57:22,919 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1882001.3333333333, ans=0.125 2023-10-15 01:57:23,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1882001.3333333333, ans=0.125 2023-10-15 01:57:34,188 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1882048.0, ans=0.025 2023-10-15 01:57:43,603 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.755e+02 2.001e+02 2.179e+02 2.866e+02, threshold=4.003e+02, percent-clipped=0.0 2023-10-15 01:57:44,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1882094.6666666667, ans=0.1 2023-10-15 01:57:56,884 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1882141.3333333333, ans=0.0 2023-10-15 01:57:57,222 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.20 vs. limit=15.0 2023-10-15 01:57:58,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1882141.3333333333, ans=0.0 2023-10-15 01:58:15,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-10-15 01:58:21,843 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1882234.6666666667, ans=0.07 2023-10-15 01:58:34,928 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-10-15 01:58:41,472 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1882281.3333333333, ans=0.1 2023-10-15 01:59:17,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1882421.3333333333, ans=0.125 2023-10-15 01:59:26,743 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1882421.3333333333, ans=0.1 2023-10-15 01:59:46,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1882514.6666666667, ans=0.125 2023-10-15 02:00:03,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.422e+02 1.910e+02 2.113e+02 2.462e+02 3.424e+02, threshold=4.227e+02, percent-clipped=0.0 2023-10-15 02:00:20,059 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1882608.0, ans=0.125 2023-10-15 02:00:40,758 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1882701.3333333333, ans=0.1 2023-10-15 02:00:48,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1882748.0, ans=0.125 2023-10-15 02:00:52,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=1882748.0, ans=0.0 2023-10-15 02:01:00,325 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.03 vs. limit=15.0 2023-10-15 02:01:02,150 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1882794.6666666667, ans=0.2 2023-10-15 02:01:04,221 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1882794.6666666667, ans=0.2 2023-10-15 02:01:10,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1882841.3333333333, ans=0.125 2023-10-15 02:01:15,045 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1882841.3333333333, ans=0.07 2023-10-15 02:01:15,880 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1882841.3333333333, ans=0.1 2023-10-15 02:01:23,281 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1882888.0, ans=0.125 2023-10-15 02:01:24,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1882888.0, ans=0.0 2023-10-15 02:01:47,561 INFO [train.py:1031] (0/4) Epoch 30, batch 7500, loss[loss=0.1978, simple_loss=0.2902, pruned_loss=0.05272, over 16855.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2772, pruned_loss=0.04606, over 32037097.31 frames. ], batch size: 188, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 02:01:54,766 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1882981.3333333333, ans=0.2 2023-10-15 02:02:02,284 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.922e+02 2.086e+02 2.342e+02 3.697e+02, threshold=4.171e+02, percent-clipped=0.0 2023-10-15 02:02:28,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1883121.3333333333, ans=0.2 2023-10-15 02:02:33,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1883121.3333333333, ans=0.125 2023-10-15 02:02:39,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1883168.0, ans=0.125 2023-10-15 02:02:39,442 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1883168.0, ans=0.1 2023-10-15 02:03:28,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1883354.6666666667, ans=0.2 2023-10-15 02:03:30,540 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.17 vs. limit=10.0 2023-10-15 02:03:35,094 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1883401.3333333333, ans=0.0 2023-10-15 02:03:43,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1883401.3333333333, ans=0.0 2023-10-15 02:03:45,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1883401.3333333333, ans=0.0 2023-10-15 02:03:51,668 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=1883448.0, ans=0.125 2023-10-15 02:03:55,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1883448.0, ans=0.2 2023-10-15 02:04:04,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1883494.6666666667, ans=15.0 2023-10-15 02:04:07,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.851e+02 2.004e+02 2.113e+02 3.812e+02, threshold=4.007e+02, percent-clipped=0.0 2023-10-15 02:04:33,988 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.87 vs. limit=22.5 2023-10-15 02:04:42,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1883588.0, ans=0.125 2023-10-15 02:04:50,968 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1883588.0, ans=0.1 2023-10-15 02:05:24,020 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1883681.3333333333, ans=0.125 2023-10-15 02:05:35,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1883728.0, ans=0.125 2023-10-15 02:05:42,965 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=22.5 2023-10-15 02:05:50,393 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1883774.6666666667, ans=0.0 2023-10-15 02:05:55,827 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1883821.3333333333, ans=0.125 2023-10-15 02:05:59,530 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.71 vs. limit=12.0 2023-10-15 02:06:08,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=1883868.0, ans=0.125 2023-10-15 02:06:08,186 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=1883868.0, ans=0.05 2023-10-15 02:06:11,669 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1883868.0, ans=0.125 2023-10-15 02:06:19,944 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.11 vs. limit=22.5 2023-10-15 02:06:24,959 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.38 vs. limit=12.0 2023-10-15 02:06:26,166 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-10-15 02:06:33,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1883961.3333333333, ans=0.125 2023-10-15 02:06:34,576 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.871e+02 2.111e+02 2.358e+02 3.300e+02, threshold=4.222e+02, percent-clipped=0.0 2023-10-15 02:06:40,997 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.14 vs. limit=12.0 2023-10-15 02:07:00,985 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1884054.6666666667, ans=0.125 2023-10-15 02:07:04,269 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1884054.6666666667, ans=0.0 2023-10-15 02:07:06,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1884101.3333333333, ans=0.035 2023-10-15 02:07:17,581 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.55 vs. limit=22.5 2023-10-15 02:07:21,936 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1884148.0, ans=0.1 2023-10-15 02:07:48,053 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1884241.3333333333, ans=0.1 2023-10-15 02:07:49,131 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1884241.3333333333, ans=0.125 2023-10-15 02:07:53,127 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1884288.0, ans=0.125 2023-10-15 02:07:58,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1884288.0, ans=0.125 2023-10-15 02:08:11,813 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1884334.6666666667, ans=0.0 2023-10-15 02:08:33,732 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.966e+02 2.143e+02 2.384e+02 3.213e+02, threshold=4.286e+02, percent-clipped=0.0 2023-10-15 02:08:35,948 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1884428.0, ans=0.125 2023-10-15 02:08:43,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1884474.6666666667, ans=0.1 2023-10-15 02:08:43,695 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1884474.6666666667, ans=0.125 2023-10-15 02:09:13,202 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1884568.0, ans=0.2 2023-10-15 02:09:13,325 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1884568.0, ans=0.1 2023-10-15 02:09:16,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1884568.0, ans=0.0 2023-10-15 02:09:26,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1884614.6666666667, ans=0.125 2023-10-15 02:10:54,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.840e+02 1.982e+02 2.150e+02 2.940e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-15 02:11:02,286 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=1884894.6666666667, ans=0.125 2023-10-15 02:11:05,158 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1884941.3333333333, ans=0.0 2023-10-15 02:11:21,901 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:11:31,653 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1885034.6666666667, ans=0.0 2023-10-15 02:11:33,920 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1885034.6666666667, ans=0.035 2023-10-15 02:11:35,123 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1885034.6666666667, ans=0.09899494936611666 2023-10-15 02:11:40,291 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1885034.6666666667, ans=0.1 2023-10-15 02:11:42,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1885081.3333333333, ans=0.0 2023-10-15 02:12:14,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1885174.6666666667, ans=0.0 2023-10-15 02:12:21,907 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=4.81 vs. limit=15.0 2023-10-15 02:12:45,449 INFO [train.py:1031] (0/4) Epoch 30, batch 8000, loss[loss=0.1621, simple_loss=0.2655, pruned_loss=0.02934, over 16815.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2767, pruned_loss=0.04568, over 32200468.28 frames. ], batch size: 98, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 02:12:58,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1885314.6666666667, ans=0.125 2023-10-15 02:13:00,387 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1885314.6666666667, ans=0.125 2023-10-15 02:13:07,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.740e+02 1.939e+02 2.204e+02 3.324e+02, threshold=3.878e+02, percent-clipped=0.0 2023-10-15 02:13:39,104 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.36 vs. limit=15.0 2023-10-15 02:13:49,155 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1885501.3333333333, ans=0.125 2023-10-15 02:14:20,495 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1885641.3333333333, ans=0.1 2023-10-15 02:14:36,755 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=22.5 2023-10-15 02:14:56,181 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1885781.3333333333, ans=0.125 2023-10-15 02:15:11,886 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.733e+02 1.945e+02 2.110e+02 3.210e+02, threshold=3.890e+02, percent-clipped=0.0 2023-10-15 02:15:23,365 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1885874.6666666667, ans=0.125 2023-10-15 02:15:35,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1885921.3333333333, ans=0.1 2023-10-15 02:15:52,924 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1885968.0, ans=0.125 2023-10-15 02:15:55,388 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.38 vs. limit=22.5 2023-10-15 02:16:05,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.70 vs. limit=15.0 2023-10-15 02:16:14,903 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1886014.6666666667, ans=0.2 2023-10-15 02:16:31,189 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1886061.3333333333, ans=0.125 2023-10-15 02:16:49,394 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1886108.0, ans=0.0 2023-10-15 02:16:56,308 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1886154.6666666667, ans=0.09899494936611666 2023-10-15 02:16:58,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1886154.6666666667, ans=0.125 2023-10-15 02:17:05,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1886154.6666666667, ans=0.125 2023-10-15 02:17:09,859 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1886201.3333333333, ans=0.125 2023-10-15 02:17:20,474 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1886248.0, ans=0.0 2023-10-15 02:17:36,664 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.837e+02 1.970e+02 2.156e+02 2.956e+02, threshold=3.941e+02, percent-clipped=0.0 2023-10-15 02:17:57,708 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1886388.0, ans=0.02 2023-10-15 02:18:02,465 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.42 vs. limit=10.0 2023-10-15 02:18:14,139 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1886434.6666666667, ans=0.0 2023-10-15 02:18:17,296 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-10-15 02:18:38,705 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.51 vs. limit=22.5 2023-10-15 02:18:59,542 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1886621.3333333333, ans=0.125 2023-10-15 02:19:04,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1886621.3333333333, ans=0.1 2023-10-15 02:19:09,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1886668.0, ans=0.125 2023-10-15 02:19:33,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1886761.3333333333, ans=0.125 2023-10-15 02:19:38,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.821e+02 1.980e+02 2.148e+02 2.727e+02, threshold=3.959e+02, percent-clipped=0.0 2023-10-15 02:19:56,616 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=1886808.0, ans=0.125 2023-10-15 02:20:01,310 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1886854.6666666667, ans=0.035 2023-10-15 02:20:13,239 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1886901.3333333333, ans=0.125 2023-10-15 02:20:23,969 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-10-15 02:20:30,803 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:20:44,972 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-10-15 02:20:49,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1886994.6666666667, ans=0.2 2023-10-15 02:21:43,437 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1887181.3333333333, ans=0.2 2023-10-15 02:21:43,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1887181.3333333333, ans=0.2 2023-10-15 02:21:43,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1887181.3333333333, ans=0.025 2023-10-15 02:21:58,357 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1887181.3333333333, ans=0.125 2023-10-15 02:22:01,589 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1887228.0, ans=0.125 2023-10-15 02:22:03,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.862e+02 2.020e+02 2.213e+02 2.968e+02, threshold=4.041e+02, percent-clipped=0.0 2023-10-15 02:22:04,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1887228.0, ans=0.1 2023-10-15 02:22:04,313 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1887228.0, ans=0.125 2023-10-15 02:22:05,714 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-10-15 02:22:16,869 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:22:31,099 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1887321.3333333333, ans=0.0 2023-10-15 02:22:46,983 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0 2023-10-15 02:22:47,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1887368.0, ans=0.1 2023-10-15 02:22:52,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1887414.6666666667, ans=0.125 2023-10-15 02:23:11,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1887508.0, ans=0.125 2023-10-15 02:23:11,970 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1887508.0, ans=0.1 2023-10-15 02:23:14,178 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-10-15 02:23:37,357 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-10-15 02:23:40,083 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1887601.3333333333, ans=0.125 2023-10-15 02:23:51,245 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.19 vs. limit=15.0 2023-10-15 02:23:54,978 INFO [train.py:1031] (0/4) Epoch 30, batch 8500, loss[loss=0.1873, simple_loss=0.2736, pruned_loss=0.05054, over 16527.00 frames. ], tot_loss[loss=0.184, simple_loss=0.277, pruned_loss=0.04553, over 32346744.08 frames. ], batch size: 56, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 02:24:09,130 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.846e+02 2.000e+02 2.257e+02 2.817e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-15 02:24:31,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.88 vs. limit=15.0 2023-10-15 02:24:37,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1887788.0, ans=0.125 2023-10-15 02:24:43,812 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1887834.6666666667, ans=0.125 2023-10-15 02:25:32,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1888021.3333333333, ans=0.0 2023-10-15 02:25:36,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1888021.3333333333, ans=0.0 2023-10-15 02:26:25,898 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1888161.3333333333, ans=0.0 2023-10-15 02:26:27,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 1.900e+02 2.108e+02 2.381e+02 3.088e+02, threshold=4.216e+02, percent-clipped=0.0 2023-10-15 02:26:30,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1888161.3333333333, ans=0.0 2023-10-15 02:26:35,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1888208.0, ans=0.1 2023-10-15 02:26:38,726 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1888208.0, ans=0.125 2023-10-15 02:26:45,253 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1888208.0, ans=0.125 2023-10-15 02:27:27,984 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.10 vs. limit=15.0 2023-10-15 02:27:34,769 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1888348.0, ans=0.0 2023-10-15 02:27:37,376 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1888394.6666666667, ans=0.125 2023-10-15 02:27:38,533 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1888394.6666666667, ans=0.125 2023-10-15 02:27:39,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1888394.6666666667, ans=0.125 2023-10-15 02:28:10,603 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1888488.0, ans=0.125 2023-10-15 02:28:13,934 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1888488.0, ans=0.2 2023-10-15 02:28:22,182 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1888534.6666666667, ans=0.125 2023-10-15 02:28:29,877 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1888534.6666666667, ans=0.04949747468305833 2023-10-15 02:28:32,999 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=1888534.6666666667, ans=0.05 2023-10-15 02:28:51,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.807e+02 1.950e+02 2.229e+02 2.917e+02, threshold=3.900e+02, percent-clipped=0.0 2023-10-15 02:28:57,734 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1888674.6666666667, ans=0.125 2023-10-15 02:29:12,490 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1888721.3333333333, ans=0.125 2023-10-15 02:29:19,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.21 vs. limit=12.0 2023-10-15 02:29:43,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1888814.6666666667, ans=0.125 2023-10-15 02:29:53,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1888861.3333333333, ans=0.0 2023-10-15 02:30:10,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1888908.0, ans=0.0 2023-10-15 02:30:11,153 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=8.0 2023-10-15 02:30:11,644 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1888908.0, ans=0.0 2023-10-15 02:30:27,323 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1888954.6666666667, ans=0.0 2023-10-15 02:30:50,611 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1889048.0, ans=0.1 2023-10-15 02:30:52,535 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1889048.0, ans=0.0 2023-10-15 02:30:58,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1889094.6666666667, ans=0.125 2023-10-15 02:30:58,848 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1889094.6666666667, ans=0.125 2023-10-15 02:31:00,151 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.771e+02 1.898e+02 2.168e+02 3.539e+02, threshold=3.796e+02, percent-clipped=0.0 2023-10-15 02:31:00,523 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1889094.6666666667, ans=0.2 2023-10-15 02:31:19,453 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1889141.3333333333, ans=0.2 2023-10-15 02:32:00,014 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1889281.3333333333, ans=0.09899494936611666 2023-10-15 02:32:09,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1889328.0, ans=0.125 2023-10-15 02:32:36,605 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.77 vs. limit=15.0 2023-10-15 02:33:04,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1889514.6666666667, ans=0.0 2023-10-15 02:33:22,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.849e+02 1.997e+02 2.248e+02 2.743e+02, threshold=3.994e+02, percent-clipped=0.0 2023-10-15 02:33:49,390 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1889654.6666666667, ans=0.125 2023-10-15 02:33:54,751 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1889701.3333333333, ans=0.0 2023-10-15 02:33:55,881 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1889701.3333333333, ans=0.125 2023-10-15 02:34:13,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889748.0, ans=0.1 2023-10-15 02:34:15,259 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1889748.0, ans=0.125 2023-10-15 02:35:00,869 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1889934.6666666667, ans=0.125 2023-10-15 02:35:06,113 INFO [train.py:1031] (0/4) Epoch 30, batch 9000, loss[loss=0.2137, simple_loss=0.3134, pruned_loss=0.057, over 16840.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2765, pruned_loss=0.04528, over 32476329.64 frames. ], batch size: 188, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 02:35:24,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.885e+02 2.044e+02 2.275e+02 2.810e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-15 02:35:36,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.72 vs. limit=22.5 2023-10-15 02:36:00,465 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.73 vs. limit=15.0 2023-10-15 02:36:18,180 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-10-15 02:36:30,694 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1890308.0, ans=0.0 2023-10-15 02:37:16,718 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:37:21,850 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.801e+02 1.952e+02 2.112e+02 2.853e+02, threshold=3.903e+02, percent-clipped=0.0 2023-10-15 02:37:34,207 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.83 vs. limit=15.0 2023-10-15 02:37:54,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1890634.6666666667, ans=0.125 2023-10-15 02:37:56,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1890634.6666666667, ans=0.0 2023-10-15 02:38:05,741 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.49 vs. limit=12.0 2023-10-15 02:38:07,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1890681.3333333333, ans=0.0 2023-10-15 02:38:11,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1890681.3333333333, ans=0.0 2023-10-15 02:38:18,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1890728.0, ans=0.1 2023-10-15 02:38:30,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1890774.6666666667, ans=0.0 2023-10-15 02:38:46,320 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1890821.3333333333, ans=0.0 2023-10-15 02:38:47,493 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-10-15 02:38:48,424 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1890821.3333333333, ans=0.125 2023-10-15 02:38:51,879 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-10-15 02:39:24,704 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1890961.3333333333, ans=0.2 2023-10-15 02:39:29,321 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.899e+02 2.058e+02 2.341e+02 3.080e+02, threshold=4.117e+02, percent-clipped=0.0 2023-10-15 02:39:36,894 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=12.0 2023-10-15 02:39:45,880 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.32 vs. limit=15.0 2023-10-15 02:39:49,729 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1891054.6666666667, ans=0.125 2023-10-15 02:40:08,199 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.48 vs. limit=12.0 2023-10-15 02:40:38,388 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=1891241.3333333333, ans=0.2 2023-10-15 02:40:44,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891241.3333333333, ans=0.1 2023-10-15 02:40:51,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891288.0, ans=0.1 2023-10-15 02:41:04,576 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=1891334.6666666667, ans=0.125 2023-10-15 02:41:05,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1891334.6666666667, ans=0.125 2023-10-15 02:41:12,525 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1891381.3333333333, ans=0.2 2023-10-15 02:41:26,388 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 1.951e+02 2.110e+02 2.368e+02 3.117e+02, threshold=4.220e+02, percent-clipped=0.0 2023-10-15 02:41:30,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1891474.6666666667, ans=0.2 2023-10-15 02:41:52,620 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=1891521.3333333333, ans=0.025 2023-10-15 02:42:15,771 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1891614.6666666667, ans=0.0 2023-10-15 02:42:41,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1891708.0, ans=0.1 2023-10-15 02:42:47,906 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.61 vs. limit=15.0 2023-10-15 02:43:06,509 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=12.0 2023-10-15 02:43:11,204 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1891801.3333333333, ans=0.125 2023-10-15 02:43:44,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.861e+02 2.112e+02 2.458e+02 2.977e+02, threshold=4.224e+02, percent-clipped=0.0 2023-10-15 02:43:50,051 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-10-15 02:44:09,361 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1891988.0, ans=0.2 2023-10-15 02:44:25,430 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1892034.6666666667, ans=0.0 2023-10-15 02:44:28,440 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1892034.6666666667, ans=0.09899494936611666 2023-10-15 02:44:32,622 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1892034.6666666667, ans=0.125 2023-10-15 02:44:34,643 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1892081.3333333333, ans=0.125 2023-10-15 02:44:47,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1892128.0, ans=0.0 2023-10-15 02:45:03,598 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1892174.6666666667, ans=0.125 2023-10-15 02:45:04,973 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1892174.6666666667, ans=0.125 2023-10-15 02:45:05,037 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1892174.6666666667, ans=0.1 2023-10-15 02:45:40,713 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1892314.6666666667, ans=0.0 2023-10-15 02:45:41,598 INFO [train.py:1031] (0/4) Epoch 30, batch 9500, loss[loss=0.1955, simple_loss=0.294, pruned_loss=0.04848, over 16647.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2772, pruned_loss=0.04555, over 32545502.35 frames. ], batch size: 241, lr: 1.14e-03, grad_scale: 16.0 2023-10-15 02:45:59,068 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1892361.3333333333, ans=0.125 2023-10-15 02:46:00,726 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.892e+02 2.036e+02 2.285e+02 3.400e+02, threshold=4.072e+02, percent-clipped=0.0 2023-10-15 02:46:01,132 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1892361.3333333333, ans=0.05 2023-10-15 02:46:10,993 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1892408.0, ans=0.0 2023-10-15 02:46:11,975 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1892408.0, ans=0.0 2023-10-15 02:46:31,334 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.91 vs. limit=22.5 2023-10-15 02:46:47,077 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-10-15 02:46:56,946 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-10-15 02:47:01,487 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.47 vs. limit=15.0 2023-10-15 02:47:09,486 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1892641.3333333333, ans=0.05 2023-10-15 02:47:09,802 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.97 vs. limit=12.0 2023-10-15 02:47:20,361 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.19 vs. limit=22.5 2023-10-15 02:47:34,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1892734.6666666667, ans=0.1 2023-10-15 02:47:49,162 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1892781.3333333333, ans=0.1 2023-10-15 02:48:02,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.922e+02 2.150e+02 2.355e+02 3.542e+02, threshold=4.300e+02, percent-clipped=0.0 2023-10-15 02:48:19,608 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1892921.3333333333, ans=0.125 2023-10-15 02:48:20,516 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1892921.3333333333, ans=0.125 2023-10-15 02:48:25,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1892921.3333333333, ans=0.2 2023-10-15 02:48:28,304 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1892921.3333333333, ans=0.125 2023-10-15 02:48:29,704 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.61 vs. limit=22.5 2023-10-15 02:49:02,100 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.52 vs. limit=6.0 2023-10-15 02:49:10,405 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1893108.0, ans=0.125 2023-10-15 02:49:10,462 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1893108.0, ans=0.2 2023-10-15 02:49:14,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1893108.0, ans=0.2 2023-10-15 02:49:43,642 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1893201.3333333333, ans=0.125 2023-10-15 02:50:06,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1893294.6666666667, ans=0.2 2023-10-15 02:50:11,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.827e+02 1.965e+02 2.130e+02 3.478e+02, threshold=3.930e+02, percent-clipped=0.0 2023-10-15 02:50:17,279 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1893341.3333333333, ans=0.2 2023-10-15 02:50:39,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1893388.0, ans=0.125 2023-10-15 02:51:13,619 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1893528.0, ans=0.0 2023-10-15 02:51:19,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1893574.6666666667, ans=0.125 2023-10-15 02:51:37,236 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1893621.3333333333, ans=0.0 2023-10-15 02:52:19,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.421e+02 1.794e+02 1.956e+02 2.062e+02 2.960e+02, threshold=3.913e+02, percent-clipped=0.0 2023-10-15 02:52:21,092 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-10-15 02:52:49,097 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-10-15 02:52:52,965 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1893901.3333333333, ans=0.125 2023-10-15 02:53:15,206 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1893994.6666666667, ans=0.0 2023-10-15 02:53:33,147 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1894088.0, ans=0.125 2023-10-15 02:53:40,170 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1894088.0, ans=0.125 2023-10-15 02:53:41,725 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1894088.0, ans=0.1 2023-10-15 02:53:43,293 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1894088.0, ans=10.0 2023-10-15 02:54:16,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1894228.0, ans=0.0 2023-10-15 02:54:18,363 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 1.800e+02 1.923e+02 2.098e+02 2.819e+02, threshold=3.847e+02, percent-clipped=0.0 2023-10-15 02:54:20,700 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1894228.0, ans=0.125 2023-10-15 02:54:27,819 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.35 vs. limit=22.5 2023-10-15 02:54:45,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1894368.0, ans=0.2 2023-10-15 02:55:26,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=1894508.0, ans=0.125 2023-10-15 02:55:31,533 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.54 vs. limit=15.0 2023-10-15 02:55:36,687 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1894554.6666666667, ans=0.0 2023-10-15 02:55:54,591 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.60 vs. limit=15.0 2023-10-15 02:56:02,002 INFO [train.py:1031] (0/4) Epoch 30, batch 10000, loss[loss=0.1905, simple_loss=0.2732, pruned_loss=0.05391, over 16512.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2764, pruned_loss=0.04514, over 32614165.25 frames. ], batch size: 266, lr: 1.14e-03, grad_scale: 32.0 2023-10-15 02:56:03,562 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=1894648.0, ans=0.125 2023-10-15 02:56:21,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.877e+02 2.028e+02 2.256e+02 3.121e+02, threshold=4.055e+02, percent-clipped=0.0 2023-10-15 02:56:23,172 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1894694.6666666667, ans=0.1 2023-10-15 02:56:23,486 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-10-15 02:56:55,664 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1894834.6666666667, ans=0.0 2023-10-15 02:57:05,137 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1894834.6666666667, ans=0.125 2023-10-15 02:57:48,425 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1895021.3333333333, ans=0.2 2023-10-15 02:57:49,588 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1895021.3333333333, ans=0.2 2023-10-15 02:58:05,522 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1895068.0, ans=0.0 2023-10-15 02:58:05,747 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-10-15 02:58:10,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-10-15 02:58:23,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1895114.6666666667, ans=0.0 2023-10-15 02:58:33,769 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-10-15 02:58:38,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.908e+02 2.115e+02 2.307e+02 3.069e+02, threshold=4.230e+02, percent-clipped=0.0 2023-10-15 02:58:38,449 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1895161.3333333333, ans=0.0 2023-10-15 02:58:59,814 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 02:59:20,307 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1895348.0, ans=15.0 2023-10-15 02:59:21,798 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1895348.0, ans=10.0 2023-10-15 02:59:23,662 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895348.0, ans=0.1 2023-10-15 02:59:40,255 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.34 vs. limit=15.0 2023-10-15 03:00:02,141 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:00:09,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1895534.6666666667, ans=0.125 2023-10-15 03:00:37,989 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.850e+02 2.085e+02 2.326e+02 2.900e+02, threshold=4.171e+02, percent-clipped=0.0 2023-10-15 03:00:40,341 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895628.0, ans=0.1 2023-10-15 03:01:12,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1895768.0, ans=0.1 2023-10-15 03:01:32,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1895814.6666666667, ans=0.035 2023-10-15 03:01:56,176 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.66 vs. limit=10.0 2023-10-15 03:02:04,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1895908.0, ans=0.125 2023-10-15 03:02:05,685 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1895908.0, ans=0.95 2023-10-15 03:02:08,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1895954.6666666667, ans=0.2 2023-10-15 03:02:10,391 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1895954.6666666667, ans=0.125 2023-10-15 03:02:15,288 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=12.0 2023-10-15 03:02:50,091 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1896048.0, ans=0.125 2023-10-15 03:02:51,690 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.60 vs. limit=15.0 2023-10-15 03:02:56,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1896094.6666666667, ans=0.125 2023-10-15 03:03:02,878 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.904e+02 2.059e+02 2.288e+02 3.525e+02, threshold=4.118e+02, percent-clipped=0.0 2023-10-15 03:03:10,727 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=1896141.3333333333, ans=0.0 2023-10-15 03:03:11,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.45 vs. limit=5.0 2023-10-15 03:03:53,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896281.3333333333, ans=0.1 2023-10-15 03:03:55,480 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=12.0 2023-10-15 03:03:58,545 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1896281.3333333333, ans=0.02 2023-10-15 03:04:20,617 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-10-15 03:04:23,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1896421.3333333333, ans=0.125 2023-10-15 03:04:29,761 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1896421.3333333333, ans=0.125 2023-10-15 03:04:35,736 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1896468.0, ans=0.125 2023-10-15 03:05:04,847 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1896561.3333333333, ans=0.0 2023-10-15 03:05:11,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.868e+02 2.044e+02 2.228e+02 3.238e+02, threshold=4.087e+02, percent-clipped=0.0 2023-10-15 03:05:18,133 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.51 vs. limit=15.0 2023-10-15 03:05:44,494 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1896701.3333333333, ans=0.125 2023-10-15 03:06:12,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1896794.6666666667, ans=0.0 2023-10-15 03:06:21,192 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=1896841.3333333333, ans=0.125 2023-10-15 03:06:21,437 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.44 vs. limit=15.0 2023-10-15 03:06:23,548 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1896841.3333333333, ans=0.2 2023-10-15 03:06:51,490 INFO [train.py:1031] (0/4) Epoch 30, batch 10500, loss[loss=0.1889, simple_loss=0.2801, pruned_loss=0.04888, over 17015.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.277, pruned_loss=0.04544, over 32656399.82 frames. ], batch size: 117, lr: 1.13e-03, grad_scale: 16.0 2023-10-15 03:07:12,093 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1897028.0, ans=0.05 2023-10-15 03:07:12,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 1.831e+02 1.990e+02 2.167e+02 2.921e+02, threshold=3.980e+02, percent-clipped=0.0 2023-10-15 03:07:14,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1897074.6666666667, ans=0.125 2023-10-15 03:07:22,111 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1897074.6666666667, ans=0.0 2023-10-15 03:07:24,284 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1897074.6666666667, ans=0.125 2023-10-15 03:07:39,305 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1897168.0, ans=0.125 2023-10-15 03:07:41,477 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1897168.0, ans=0.125 2023-10-15 03:07:54,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1897214.6666666667, ans=0.2 2023-10-15 03:07:55,954 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.56 vs. limit=15.0 2023-10-15 03:08:02,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1897214.6666666667, ans=0.125 2023-10-15 03:08:19,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.19 vs. limit=22.5 2023-10-15 03:08:42,342 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1897354.6666666667, ans=0.125 2023-10-15 03:09:09,458 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.87 vs. limit=22.5 2023-10-15 03:09:11,550 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1897448.0, ans=0.0 2023-10-15 03:09:12,881 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=12.0 2023-10-15 03:09:15,701 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1897448.0, ans=0.125 2023-10-15 03:09:41,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.973e+02 2.109e+02 2.347e+02 3.033e+02, threshold=4.218e+02, percent-clipped=0.0 2023-10-15 03:09:52,466 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1897541.3333333333, ans=0.125 2023-10-15 03:10:11,004 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1897588.0, ans=0.1 2023-10-15 03:10:13,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1897588.0, ans=0.2 2023-10-15 03:10:37,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1897681.3333333333, ans=0.1 2023-10-15 03:10:51,713 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=12.0 2023-10-15 03:11:07,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=1897774.6666666667, ans=0.2 2023-10-15 03:11:22,298 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.95 vs. limit=22.5 2023-10-15 03:11:48,822 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1897914.6666666667, ans=0.0 2023-10-15 03:12:01,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.896e+02 2.026e+02 2.214e+02 3.195e+02, threshold=4.052e+02, percent-clipped=0.0 2023-10-15 03:12:05,120 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.81 vs. limit=22.5 2023-10-15 03:12:58,349 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1898194.6666666667, ans=0.0 2023-10-15 03:13:05,709 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1898194.6666666667, ans=0.125 2023-10-15 03:13:18,073 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1898241.3333333333, ans=0.0 2023-10-15 03:13:34,517 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1898334.6666666667, ans=0.0 2023-10-15 03:13:36,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1898334.6666666667, ans=0.0 2023-10-15 03:14:11,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.913e+02 2.120e+02 2.407e+02 3.969e+02, threshold=4.240e+02, percent-clipped=0.0 2023-10-15 03:14:27,244 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.03 vs. limit=22.5 2023-10-15 03:14:29,401 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898521.3333333333, ans=0.1 2023-10-15 03:15:06,821 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1898614.6666666667, ans=0.0 2023-10-15 03:15:42,692 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1898708.0, ans=0.125 2023-10-15 03:15:57,509 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1898754.6666666667, ans=0.0 2023-10-15 03:16:18,153 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1898848.0, ans=0.1 2023-10-15 03:16:30,176 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-10-15 03:16:32,883 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-10-15 03:16:45,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.759e+02 1.949e+02 2.162e+02 3.759e+02, threshold=3.899e+02, percent-clipped=0.0 2023-10-15 03:16:55,974 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=15.0 2023-10-15 03:17:11,556 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.21 vs. limit=22.5 2023-10-15 03:17:23,036 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1899034.6666666667, ans=0.125 2023-10-15 03:17:26,717 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1899034.6666666667, ans=0.1 2023-10-15 03:17:30,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1899034.6666666667, ans=0.1 2023-10-15 03:17:43,268 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1899081.3333333333, ans=0.0 2023-10-15 03:18:18,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1899174.6666666667, ans=0.125 2023-10-15 03:18:28,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1899221.3333333333, ans=0.07 2023-10-15 03:18:43,922 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1899268.0, ans=0.0 2023-10-15 03:18:46,980 INFO [train.py:1031] (0/4) Epoch 30, batch 11000, loss[loss=0.1786, simple_loss=0.2483, pruned_loss=0.05448, over 12718.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2768, pruned_loss=0.04544, over 32682596.27 frames. ], batch size: 440, lr: 1.13e-03, grad_scale: 16.0 2023-10-15 03:18:55,427 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1899314.6666666667, ans=0.0 2023-10-15 03:19:10,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1899361.3333333333, ans=0.0 2023-10-15 03:19:11,471 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1899361.3333333333, ans=0.1 2023-10-15 03:19:14,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.888e+02 1.999e+02 2.288e+02 2.873e+02, threshold=3.997e+02, percent-clipped=0.0 2023-10-15 03:19:21,396 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1899408.0, ans=0.0 2023-10-15 03:19:22,368 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:19:48,637 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.56 vs. limit=15.0 2023-10-15 03:20:13,171 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=10.08 vs. limit=22.5 2023-10-15 03:20:21,484 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1899641.3333333333, ans=0.0 2023-10-15 03:20:44,681 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=1899688.0, ans=0.125 2023-10-15 03:20:49,012 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1899688.0, ans=0.0 2023-10-15 03:21:00,982 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1899734.6666666667, ans=0.035 2023-10-15 03:21:10,041 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.70 vs. limit=22.5 2023-10-15 03:21:21,512 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1899781.3333333333, ans=0.2 2023-10-15 03:21:46,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.901e+02 2.059e+02 2.325e+02 3.331e+02, threshold=4.118e+02, percent-clipped=0.0 2023-10-15 03:22:22,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1899968.0, ans=0.0 2023-10-15 03:22:28,908 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1899968.0, ans=0.0 2023-10-15 03:22:33,159 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1899968.0, ans=10.0 2023-10-15 03:22:42,573 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:23:02,104 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1900061.3333333333, ans=0.0 2023-10-15 03:24:21,075 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1900294.6666666667, ans=0.2 2023-10-15 03:24:25,008 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.50 vs. limit=22.5 2023-10-15 03:24:28,774 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1900294.6666666667, ans=0.09899494936611666 2023-10-15 03:24:33,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.743e+02 1.887e+02 2.069e+02 2.808e+02, threshold=3.773e+02, percent-clipped=0.0 2023-10-15 03:25:09,196 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.58 vs. limit=12.0 2023-10-15 03:25:35,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1900528.0, ans=0.0 2023-10-15 03:25:40,858 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1900528.0, ans=0.125 2023-10-15 03:25:46,544 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1900528.0, ans=0.125 2023-10-15 03:26:23,171 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1900668.0, ans=0.1 2023-10-15 03:26:25,317 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1900668.0, ans=0.1 2023-10-15 03:26:27,205 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.29 vs. limit=15.0 2023-10-15 03:26:31,966 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1900714.6666666667, ans=0.0 2023-10-15 03:26:40,152 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1900714.6666666667, ans=0.125 2023-10-15 03:27:00,684 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-10-15 03:27:04,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.846e+02 2.049e+02 2.325e+02 2.789e+02, threshold=4.098e+02, percent-clipped=0.0 2023-10-15 03:27:13,418 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1900808.0, ans=0.0 2023-10-15 03:27:16,261 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1900808.0, ans=0.04949747468305833 2023-10-15 03:27:36,445 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1900901.3333333333, ans=0.0 2023-10-15 03:28:03,483 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1900994.6666666667, ans=0.125 2023-10-15 03:28:25,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1901041.3333333333, ans=0.2 2023-10-15 03:28:39,865 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.90 vs. limit=22.5 2023-10-15 03:28:44,240 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1901088.0, ans=0.125 2023-10-15 03:28:45,557 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1901088.0, ans=0.0 2023-10-15 03:29:43,375 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1901228.0, ans=0.125 2023-10-15 03:29:45,003 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.880e+02 2.006e+02 2.240e+02 3.317e+02, threshold=4.011e+02, percent-clipped=0.0 2023-10-15 03:30:11,235 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1901321.3333333333, ans=0.125 2023-10-15 03:30:47,086 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1901414.6666666667, ans=0.125 2023-10-15 03:30:55,631 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1901461.3333333333, ans=0.2 2023-10-15 03:31:02,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1901461.3333333333, ans=0.1 2023-10-15 03:31:06,980 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1901461.3333333333, ans=0.0 2023-10-15 03:31:57,130 INFO [train.py:1031] (0/4) Epoch 30, batch 11500, loss[loss=0.1876, simple_loss=0.2818, pruned_loss=0.04666, over 16591.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2765, pruned_loss=0.04528, over 32717545.56 frames. ], batch size: 56, lr: 1.13e-03, grad_scale: 16.0 2023-10-15 03:32:14,353 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1901694.6666666667, ans=0.1 2023-10-15 03:32:19,348 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1901694.6666666667, ans=0.2 2023-10-15 03:32:30,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.992e+02 2.212e+02 2.432e+02 3.211e+02, threshold=4.424e+02, percent-clipped=0.0 2023-10-15 03:32:43,078 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1901788.0, ans=0.125 2023-10-15 03:32:52,384 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1901788.0, ans=0.125 2023-10-15 03:33:40,279 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:33:48,684 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1901974.6666666667, ans=0.1 2023-10-15 03:33:52,418 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.97 vs. limit=10.0 2023-10-15 03:33:54,728 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1901974.6666666667, ans=0.035 2023-10-15 03:33:56,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1901974.6666666667, ans=0.125 2023-10-15 03:34:03,265 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1902021.3333333333, ans=0.125 2023-10-15 03:34:20,207 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1902068.0, ans=0.5 2023-10-15 03:34:27,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1902068.0, ans=0.0 2023-10-15 03:34:30,894 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1902114.6666666667, ans=0.0 2023-10-15 03:34:46,110 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1902161.3333333333, ans=0.125 2023-10-15 03:34:48,799 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=1902161.3333333333, ans=0.2 2023-10-15 03:34:53,661 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1902161.3333333333, ans=0.125 2023-10-15 03:34:54,487 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1902161.3333333333, ans=0.0 2023-10-15 03:35:03,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.839e+02 2.030e+02 2.267e+02 3.028e+02, threshold=4.060e+02, percent-clipped=0.0 2023-10-15 03:35:04,596 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.93 vs. limit=22.5 2023-10-15 03:35:42,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1902301.3333333333, ans=0.1 2023-10-15 03:35:46,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1902301.3333333333, ans=0.125 2023-10-15 03:36:00,174 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1902348.0, ans=0.125 2023-10-15 03:36:04,523 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-10-15 03:36:11,501 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1902348.0, ans=0.0 2023-10-15 03:36:11,963 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-10-15 03:36:22,776 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1902394.6666666667, ans=0.1 2023-10-15 03:36:30,072 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.23 vs. limit=15.0 2023-10-15 03:36:30,937 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1902441.3333333333, ans=0.125 2023-10-15 03:36:46,124 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1902441.3333333333, ans=0.0 2023-10-15 03:36:53,100 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1902488.0, ans=0.125 2023-10-15 03:37:22,009 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1902534.6666666667, ans=0.0 2023-10-15 03:37:38,438 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1902581.3333333333, ans=0.125 2023-10-15 03:37:43,643 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.81 vs. limit=15.0 2023-10-15 03:37:54,526 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1902628.0, ans=0.0 2023-10-15 03:38:04,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.804e+02 1.942e+02 2.177e+02 3.091e+02, threshold=3.884e+02, percent-clipped=0.0 2023-10-15 03:38:12,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1902674.6666666667, ans=0.1 2023-10-15 03:38:30,748 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1902721.3333333333, ans=0.0 2023-10-15 03:38:33,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1902721.3333333333, ans=0.1 2023-10-15 03:39:35,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1902861.3333333333, ans=0.125 2023-10-15 03:39:50,434 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1902861.3333333333, ans=0.1 2023-10-15 03:39:55,802 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1902908.0, ans=0.125 2023-10-15 03:40:35,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1902954.6666666667, ans=0.125 2023-10-15 03:41:37,698 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1903048.0, ans=0.0 2023-10-15 03:41:46,638 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.93 vs. limit=12.0 2023-10-15 03:41:50,372 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1903048.0, ans=0.125 2023-10-15 03:41:57,082 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-10-15 03:42:10,345 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1903094.6666666667, ans=0.125 2023-10-15 03:42:28,161 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=1903141.3333333333, ans=0.0 2023-10-15 03:42:30,286 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.876e+02 2.081e+02 2.298e+02 3.208e+02, threshold=4.162e+02, percent-clipped=0.0 2023-10-15 03:42:41,645 INFO [scaling.py:979] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=8.0 2023-10-15 03:43:25,645 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1903188.0, ans=0.125 2023-10-15 03:43:33,578 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1903234.6666666667, ans=0.125 2023-10-15 03:43:51,624 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1903281.3333333333, ans=0.1 2023-10-15 03:44:04,294 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1903281.3333333333, ans=0.125 2023-10-15 03:44:07,101 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.40 vs. limit=22.5 2023-10-15 03:45:10,469 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.74 vs. limit=15.0 2023-10-15 03:45:24,749 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1903421.3333333333, ans=0.0 2023-10-15 03:46:15,541 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1903514.6666666667, ans=0.0 2023-10-15 03:46:20,503 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1903514.6666666667, ans=0.125 2023-10-15 03:46:20,507 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1903514.6666666667, ans=0.0 2023-10-15 03:47:01,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.893e+02 2.098e+02 2.406e+02 4.118e+02, threshold=4.196e+02, percent-clipped=0.0 2023-10-15 03:47:03,380 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1903608.0, ans=0.1 2023-10-15 03:47:28,688 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1903654.6666666667, ans=0.125 2023-10-15 03:47:35,092 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1903654.6666666667, ans=0.0 2023-10-15 03:47:49,689 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1903701.3333333333, ans=0.125 2023-10-15 03:47:59,081 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.85 vs. limit=10.0 2023-10-15 03:48:03,374 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=1903701.3333333333, ans=0.125 2023-10-15 03:48:54,287 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=1903794.6666666667, ans=0.125 2023-10-15 03:49:01,869 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.38 vs. limit=15.0 2023-10-15 03:49:07,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1903841.3333333333, ans=0.09899494936611666 2023-10-15 03:49:18,897 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=12.0 2023-10-15 03:50:12,200 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1903934.6666666667, ans=0.0 2023-10-15 03:50:25,090 INFO [train.py:1031] (0/4) Epoch 30, batch 12000, loss[loss=0.1994, simple_loss=0.2928, pruned_loss=0.05302, over 16960.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2767, pruned_loss=0.04512, over 32750189.63 frames. ], batch size: 77, lr: 1.13e-03, grad_scale: 32.0 2023-10-15 03:50:30,568 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/checkpoint-408000.pt 2023-10-15 03:50:40,361 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:51:13,157 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1904028.0, ans=0.2 2023-10-15 03:51:19,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 1.815e+02 2.019e+02 2.264e+02 3.625e+02, threshold=4.038e+02, percent-clipped=0.0 2023-10-15 03:51:58,496 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1904121.3333333333, ans=0.125 2023-10-15 03:52:24,196 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1904168.0, ans=0.125 2023-10-15 03:52:36,961 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.96 vs. limit=15.0 2023-10-15 03:52:38,896 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1904168.0, ans=0.0 2023-10-15 03:52:56,558 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1904214.6666666667, ans=0.0 2023-10-15 03:53:48,891 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1904354.6666666667, ans=0.0 2023-10-15 03:53:56,932 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1904354.6666666667, ans=0.0 2023-10-15 03:54:37,634 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1904494.6666666667, ans=0.0 2023-10-15 03:54:45,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.793e+02 1.980e+02 2.129e+02 2.949e+02, threshold=3.961e+02, percent-clipped=0.0 2023-10-15 03:54:54,019 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1904541.3333333333, ans=0.0 2023-10-15 03:55:02,901 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1904588.0, ans=0.0 2023-10-15 03:55:23,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1904681.3333333333, ans=0.125 2023-10-15 03:55:23,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1904681.3333333333, ans=0.1 2023-10-15 03:55:30,996 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1904728.0, ans=0.125 2023-10-15 03:55:36,285 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:55:41,541 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-10-15 03:56:17,515 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1904914.6666666667, ans=0.0 2023-10-15 03:56:17,573 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1904914.6666666667, ans=0.125 2023-10-15 03:56:41,616 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.921e+02 2.102e+02 2.274e+02 3.124e+02, threshold=4.204e+02, percent-clipped=0.0 2023-10-15 03:56:50,149 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1905054.6666666667, ans=0.0 2023-10-15 03:56:51,062 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1905054.6666666667, ans=0.125 2023-10-15 03:56:52,410 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2023-10-15 03:56:55,592 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1905054.6666666667, ans=0.125 2023-10-15 03:56:55,693 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1905054.6666666667, ans=0.125 2023-10-15 03:56:56,846 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.50 vs. limit=15.0 2023-10-15 03:57:02,788 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1905101.3333333333, ans=0.0 2023-10-15 03:57:18,473 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1905148.0, ans=10.0 2023-10-15 03:57:21,952 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1905148.0, ans=0.0 2023-10-15 03:57:22,248 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=22.5 2023-10-15 03:57:46,018 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:57:48,060 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=1905288.0, ans=0.0 2023-10-15 03:57:49,208 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1905288.0, ans=0.2 2023-10-15 03:57:50,475 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1905288.0, ans=0.07 2023-10-15 03:58:01,114 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1905334.6666666667, ans=0.125 2023-10-15 03:58:06,148 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=1905334.6666666667, ans=0.125 2023-10-15 03:58:24,432 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1905428.0, ans=0.0 2023-10-15 03:58:27,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1905428.0, ans=0.125 2023-10-15 03:58:30,944 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1905428.0, ans=0.0 2023-10-15 03:58:34,621 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 1.929e+02 2.061e+02 2.264e+02 5.478e+02, threshold=4.122e+02, percent-clipped=1.0 2023-10-15 03:58:42,422 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1905474.6666666667, ans=0.1 2023-10-15 03:58:47,663 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-10-15 03:58:55,582 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1905568.0, ans=0.04949747468305833 2023-10-15 03:59:02,820 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=12.0 2023-10-15 03:59:06,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1905568.0, ans=0.0 2023-10-15 03:59:25,163 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.46 vs. limit=10.0 2023-10-15 03:59:38,130 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1905708.0, ans=0.07 2023-10-15 03:59:39,000 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1905708.0, ans=0.125 2023-10-15 03:59:47,524 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=1905708.0, ans=0.125 2023-10-15 03:59:49,446 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 03:59:51,756 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1905754.6666666667, ans=0.0 2023-10-15 04:00:25,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1905848.0, ans=0.0 2023-10-15 04:00:25,537 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1905848.0, ans=0.125 2023-10-15 04:00:36,334 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1905894.6666666667, ans=0.125 2023-10-15 04:00:44,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.888e+02 2.057e+02 2.197e+02 2.884e+02, threshold=4.114e+02, percent-clipped=0.0 2023-10-15 04:00:59,250 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1905988.0, ans=0.1 2023-10-15 04:01:03,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1905988.0, ans=0.0 2023-10-15 04:01:34,532 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1906128.0, ans=0.2 2023-10-15 04:01:35,826 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1906128.0, ans=0.1 2023-10-15 04:01:45,810 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1906174.6666666667, ans=0.0 2023-10-15 04:01:50,703 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1906174.6666666667, ans=0.0 2023-10-15 04:02:22,048 INFO [train.py:1031] (0/4) Epoch 30, batch 12500, loss[loss=0.1677, simple_loss=0.2704, pruned_loss=0.03245, over 16632.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2764, pruned_loss=0.04512, over 32776983.66 frames. ], batch size: 219, lr: 1.13e-03, grad_scale: 32.0 2023-10-15 04:02:22,272 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1906314.6666666667, ans=0.125 2023-10-15 04:02:30,839 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=9.08 vs. limit=22.5 2023-10-15 04:02:37,513 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1906361.3333333333, ans=0.125 2023-10-15 04:02:39,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1906361.3333333333, ans=0.125 2023-10-15 04:02:41,594 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=12.0 2023-10-15 04:02:45,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.864e+02 1.980e+02 2.123e+02 2.771e+02, threshold=3.960e+02, percent-clipped=0.0 2023-10-15 04:02:47,106 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1906408.0, ans=0.125 2023-10-15 04:02:51,655 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1906408.0, ans=0.0 2023-10-15 04:03:05,122 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1906501.3333333333, ans=0.0 2023-10-15 04:03:12,527 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1906501.3333333333, ans=0.125 2023-10-15 04:03:26,283 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1906594.6666666667, ans=0.0 2023-10-15 04:03:33,175 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1906594.6666666667, ans=0.0 2023-10-15 04:04:04,854 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1906734.6666666667, ans=0.2 2023-10-15 04:04:05,022 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1906734.6666666667, ans=0.125 2023-10-15 04:04:38,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.822e+02 2.011e+02 2.232e+02 2.990e+02, threshold=4.023e+02, percent-clipped=0.0 2023-10-15 04:04:47,369 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-10-15 04:05:04,464 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-10-15 04:05:16,267 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1907014.6666666667, ans=0.125 2023-10-15 04:05:35,590 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1907108.0, ans=0.0 2023-10-15 04:05:41,755 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1907108.0, ans=0.0 2023-10-15 04:05:48,683 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1907154.6666666667, ans=0.0 2023-10-15 04:05:49,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1907154.6666666667, ans=0.125 2023-10-15 04:05:52,846 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1907154.6666666667, ans=0.125 2023-10-15 04:06:19,570 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1907248.0, ans=0.0 2023-10-15 04:06:26,484 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.37 vs. limit=22.5 2023-10-15 04:06:29,034 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1907294.6666666667, ans=0.125 2023-10-15 04:06:36,426 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1907341.3333333333, ans=0.1 2023-10-15 04:06:38,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 1.845e+02 1.981e+02 2.125e+02 2.952e+02, threshold=3.963e+02, percent-clipped=0.0 2023-10-15 04:06:42,889 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1907341.3333333333, ans=0.125 2023-10-15 04:07:00,044 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1907434.6666666667, ans=0.05 2023-10-15 04:07:11,167 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-10-15 04:07:14,605 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1907481.3333333333, ans=0.1 2023-10-15 04:07:16,222 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1907481.3333333333, ans=0.0 2023-10-15 04:08:04,230 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1907668.0, ans=0.015 2023-10-15 04:08:08,555 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1907668.0, ans=10.0 2023-10-15 04:08:09,943 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.63 vs. limit=15.0 2023-10-15 04:08:38,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.853e+02 2.074e+02 2.290e+02 3.022e+02, threshold=4.147e+02, percent-clipped=0.0 2023-10-15 04:08:45,315 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1907854.6666666667, ans=0.125 2023-10-15 04:08:51,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1907854.6666666667, ans=0.125 2023-10-15 04:08:53,752 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=1907854.6666666667, ans=0.09899494936611666 2023-10-15 04:09:01,988 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1907901.3333333333, ans=0.125 2023-10-15 04:09:04,725 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-10-15 04:09:04,730 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1907901.3333333333, ans=22.5 2023-10-15 04:09:08,262 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 04:09:13,815 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1907948.0, ans=0.0 2023-10-15 04:09:16,493 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1907948.0, ans=0.2 2023-10-15 04:09:52,691 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1908088.0, ans=0.2 2023-10-15 04:09:52,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1908088.0, ans=0.125 2023-10-15 04:10:02,950 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.43 vs. limit=15.0 2023-10-15 04:10:13,057 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1908181.3333333333, ans=0.2 2023-10-15 04:10:20,406 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1908181.3333333333, ans=0.125 2023-10-15 04:10:33,026 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.45 vs. limit=15.0 2023-10-15 04:10:39,553 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1908274.6666666667, ans=0.09899494936611666 2023-10-15 04:10:40,266 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.822e+02 2.003e+02 2.172e+02 3.224e+02, threshold=4.006e+02, percent-clipped=0.0 2023-10-15 04:10:44,720 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=1908274.6666666667, ans=0.0 2023-10-15 04:11:00,498 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=1908368.0, ans=0.125 2023-10-15 04:11:02,456 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1908368.0, ans=0.2 2023-10-15 04:11:22,190 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.11 vs. limit=12.0 2023-10-15 04:11:22,336 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-10-15 04:11:22,902 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1908461.3333333333, ans=0.2 2023-10-15 04:11:37,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1908508.0, ans=0.125 2023-10-15 04:12:08,863 INFO [train.py:1031] (0/4) Epoch 30, batch 13000, loss[loss=0.1937, simple_loss=0.2793, pruned_loss=0.05408, over 16924.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2772, pruned_loss=0.04531, over 32791712.93 frames. ], batch size: 72, lr: 1.13e-03, grad_scale: 16.0 2023-10-15 04:12:09,228 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1908648.0, ans=0.0 2023-10-15 04:12:10,289 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.17 vs. limit=15.0 2023-10-15 04:12:37,765 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1908741.3333333333, ans=0.125 2023-10-15 04:12:41,535 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.881e+02 2.009e+02 2.323e+02 3.405e+02, threshold=4.018e+02, percent-clipped=0.0 2023-10-15 04:12:43,551 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.52 vs. limit=15.0 2023-10-15 04:13:26,678 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1908881.3333333333, ans=0.0 2023-10-15 04:13:53,493 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-10-15 04:14:37,482 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1909161.3333333333, ans=0.0 2023-10-15 04:14:42,392 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.30 vs. limit=22.5 2023-10-15 04:14:56,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.904e+02 2.113e+02 2.350e+02 3.214e+02, threshold=4.226e+02, percent-clipped=0.0 2023-10-15 04:15:06,744 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1909254.6666666667, ans=22.5 2023-10-15 04:15:12,178 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1909254.6666666667, ans=0.1 2023-10-15 04:15:28,616 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-10-15 04:15:30,885 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.21 vs. limit=22.5 2023-10-15 04:15:54,423 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1909441.3333333333, ans=0.1 2023-10-15 04:16:02,636 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1909441.3333333333, ans=0.0 2023-10-15 04:16:15,264 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1909488.0, ans=0.1 2023-10-15 04:16:18,705 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1909488.0, ans=0.125 2023-10-15 04:16:19,584 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=1909488.0, ans=0.2 2023-10-15 04:16:20,917 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=3.80 vs. limit=10.0 2023-10-15 04:16:22,518 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1909534.6666666667, ans=0.2 2023-10-15 04:16:59,311 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.95 vs. limit=22.5 2023-10-15 04:17:01,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 1.822e+02 1.959e+02 2.153e+02 3.011e+02, threshold=3.917e+02, percent-clipped=0.0 2023-10-15 04:17:30,013 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1909768.0, ans=0.0 2023-10-15 04:17:31,096 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1909768.0, ans=0.125 2023-10-15 04:17:58,169 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1909908.0, ans=0.0 2023-10-15 04:18:02,137 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=22.5 2023-10-15 04:18:14,485 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1909954.6666666667, ans=0.0 2023-10-15 04:18:18,086 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.47 vs. limit=15.0 2023-10-15 04:18:26,101 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1910001.3333333333, ans=0.0 2023-10-15 04:18:29,090 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1910001.3333333333, ans=0.0 2023-10-15 04:18:38,779 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1910048.0, ans=0.035 2023-10-15 04:18:39,931 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=1910048.0, ans=0.07 2023-10-15 04:18:51,712 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.47 vs. limit=15.0 2023-10-15 04:19:05,916 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1910141.3333333333, ans=0.0 2023-10-15 04:19:06,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.916e+02 2.061e+02 2.277e+02 3.457e+02, threshold=4.122e+02, percent-clipped=0.0 2023-10-15 04:19:21,254 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=1910188.0, ans=0.09899494936611666 2023-10-15 04:19:35,715 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1910281.3333333333, ans=0.0 2023-10-15 04:19:37,030 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1910281.3333333333, ans=0.0 2023-10-15 04:20:06,480 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1910374.6666666667, ans=0.125 2023-10-15 04:20:08,913 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 04:20:15,781 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1910421.3333333333, ans=0.125 2023-10-15 04:20:15,949 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 04:20:24,790 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1910421.3333333333, ans=0.0 2023-10-15 04:20:43,654 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1910514.6666666667, ans=0.1 2023-10-15 04:20:53,237 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1910561.3333333333, ans=0.1 2023-10-15 04:20:55,883 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1910561.3333333333, ans=0.1 2023-10-15 04:20:59,502 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1910561.3333333333, ans=10.0 2023-10-15 04:21:04,184 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1910608.0, ans=0.0 2023-10-15 04:21:05,168 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1910608.0, ans=0.0 2023-10-15 04:21:11,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.854e+02 2.015e+02 2.235e+02 2.698e+02, threshold=4.029e+02, percent-clipped=0.0 2023-10-15 04:21:35,926 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1910701.3333333333, ans=0.0 2023-10-15 04:22:06,992 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-10-15 04:22:29,841 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1910934.6666666667, ans=0.2 2023-10-15 04:22:41,595 INFO [train.py:1031] (0/4) Epoch 30, batch 13500, loss[loss=0.1858, simple_loss=0.2848, pruned_loss=0.04335, over 16898.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2765, pruned_loss=0.0453, over 32782882.76 frames. ], batch size: 138, lr: 1.13e-03, grad_scale: 16.0 2023-10-15 04:22:41,955 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1910981.3333333333, ans=0.125 2023-10-15 04:22:46,714 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=1910981.3333333333, ans=0.0 2023-10-15 04:22:47,629 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1910981.3333333333, ans=0.0 2023-10-15 04:22:53,595 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-10-15 04:23:04,136 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911028.0, ans=0.1 2023-10-15 04:23:11,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.864e+02 2.000e+02 2.170e+02 3.073e+02, threshold=4.000e+02, percent-clipped=0.0 2023-10-15 04:23:38,952 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-10-15 04:23:42,844 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=1911214.6666666667, ans=0.0 2023-10-15 04:23:48,649 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.34 vs. limit=22.5 2023-10-15 04:23:55,409 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1911261.3333333333, ans=0.125 2023-10-15 04:24:03,417 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1911308.0, ans=0.125 2023-10-15 04:24:04,662 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.85 vs. limit=6.0 2023-10-15 04:24:07,983 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-10-15 04:24:08,112 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1911308.0, ans=0.2 2023-10-15 04:24:08,344 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.76 vs. limit=6.0 2023-10-15 04:24:11,540 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1911308.0, ans=0.2 2023-10-15 04:24:13,777 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911308.0, ans=0.1 2023-10-15 04:24:13,805 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1911308.0, ans=0.125 2023-10-15 04:24:20,270 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1911354.6666666667, ans=0.0 2023-10-15 04:24:28,674 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1911401.3333333333, ans=0.1 2023-10-15 04:24:32,268 INFO [scaling.py:1069] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-10-15 04:24:35,478 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1911401.3333333333, ans=0.1 2023-10-15 04:24:35,808 INFO [scaling.py:979] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.62 vs. limit=15.0 2023-10-15 04:24:39,947 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911448.0, ans=0.1 2023-10-15 04:25:03,218 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1911541.3333333333, ans=0.2 2023-10-15 04:25:07,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.849e+02 2.027e+02 2.303e+02 3.665e+02, threshold=4.053e+02, percent-clipped=0.0 2023-10-15 04:25:14,263 INFO [scaling.py:199] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1911588.0, ans=0.125 2023-10-15 04:25:40,271 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_XL_bpe/epoch-30.pt 2023-10-15 04:25:49,846 INFO [train.py:1246] (0/4) Done!